Single image depth estimation (abbreviated as SIDE hereafter) has applications in augmented reality, robotics and artistic image enhancement, such as bokeh rendering . However, the problem is highly under-constrained, a given 2D image can be mapped to any distinct 3D scene in real world. Recently, with the advent of deep learning, the problem of SIDE has witnessed significant progress [6, 5, 21, 15, 29, 18, 1, 19, 7, 31, 10, 8, 30, 20, 22, 16, 17]. Many SIDE methods train an encoder-decoder style network using a pixel-wise regression loss . However, it is challenging for the network to regress true depth of a scene—with focal length adjustments, two different cameras placed at different distances from a target scene can capture identical 2D images [32, 4].
Inspired from , recently Fu et al.  proposed an ordinal regression based approach called DORN which outperformed other SIDE methods by a significant margin. However, it needs to be noted that DORN is trained using ordinal classification loss, while for inference the authors apply a naïve threshold strategy on the classification output to determine per-pixel depth label. The depth maps generated with this strategy do not obey smoothness and boundary constraints, and have severe discretization artifacts (see Fig. 1c). Consequently, the depth maps are not suitable for practical applications. In this work, we solve this important problem, while also advancing the state-of-the-art on challenging benchmarks.
2 Related Work
Eigen et al. [6, 5] were first to attempt deep learning based SIDE. They also proposed a scale-invariant loss term for training a robust depth estimation network.  integrated conditional random fields (CRFs) into a CNN to learn unary and pairwise potentials of CRF.  proposed a residual learning based depth estimation model with faster upconvolutions.  proposed a two-streamed network to estimate depth values and depth gradients separately.  presented a pioneering approach by formulating depth estimation as a classification problem which outperformed all the previous methods.  proposed a deep attention based classification network, it involved re-weighting of channels in skip connections to handle varying depth ranges. Recently,  proposed an ordinal classification based approach which outperformed all the existing methods. However, depth estimation in  is not performed in end-to-end fashion, leading to sub-optimal results and depth artifacts.  proposed to estimate a coarse scale relative depth map which serves as a global scene prior for estimating true depth.  advocated the use of large rectangular convolution kernels based on the observations on depth variation along vertical and horizontal directions.  and  used attention mechanism  to fuse multiscale features maps.  proposed a piecewise planar depth estimation network to perform plane segmentation task. 
utilized Fourier transform based approach to combine multiple depth estimations.
Existing methods primarily focus on improving pixel-wise accuracy which does not usually correlate with qualitative aspects, such as depth consistency, edge accuracy and smooth depth variations 
. As a result, many current state-of-the-art methods generate depth maps which are not suited for practical applications. To summarize, following are the major limitations in existing methods: (a). many methods adopt pixel-wise regression approach which is a difficult learning task, (b). classification based SIDE approaches do not utilize output probability distribution during training, (c). many methods achieve good quantitative scores, however, the depth maps lack practical utility. In this work, we address these important limitations with several novel formulations in network design and loss function. The proposed approach targets quantitative as well as qualitative aspects of depth estimation. The proposed model–AcED–generatesAccurate and Edge-consistent Depth and achieves state-of-the-art results on challenging benchmarks. AcED also has a practical utility, as it enables challenging single camera bokeh application.
Following are the major contributions of this work: (a) a novel two stage SIDE approach comprising of ordinal classification and pixel-wise regression. (b) a novel fully differentiable variant of ordinal regression for end-to-end training. (c) a novel confidence map computation technique derived from proposed fully differentiable ordinal regression. (d) extensive experiments and ablation studies to demonstrate the advantages of algorithmic choices. (e) we show the utility of the proposed model in a challenging real life application.
3 Proposed Approach
3.1 Architecture Overview
Fig. 2 shows the detailed architecture of AcED, it can be conceptually divided into three subnetworks:
3.1.1 Dense Feature Extraction
SIDE is an ill-posed problem and it requires high degree of scene understanding. Existing methods adopt a CNN pre-trained on scene recognition task for dense feature extraction. Popular options include VGGNet, ResNet , DenseNet  and SENet . In this work, we adopt SENet-154 as the backbone encoder network because of its superior performance on image classification task.
3.1.2 Depth Estimation
This is the coarse scale depth estimation subnetwork which is trained using proposed ordinal regression loss. It estimates a coarse scale depth map along with a confidence map (see Section 3.3). This subnetwork (comprising of green blocks and the fully differentiable ordinal regression block in Fig. 2) upsamples the high-level feature maps using the low-level information via skip connections.
3.1.3 Depth Refinement
This is the depth refinement subnetwork (see Section 3.4), it takes coarse scale depth map, confidence map and multiscale low-level feature maps as input to correct the low confidence areas and generate depth map with improved structural information.
3.2 Depth Discretization
In order to formulate depth estimation as a classification problem, the depth map is discretized into multiple classes, where each class corresponds to a unique depth value. Similar to , we adopt spacing increasing discretization. If the depth range of a given training dataset is [, ] and is the desired number of discretization levels, a spacing-increasing discretization can be achieved by uniformly discretizing the depth range in logarithmic space. Mathematically, the depth discretization threshold is computed as follows:
3.3 Fully Differentiable Ordinal Regression
First, we explain the ordinal classification technique. As described in Section 3.2, the ground-truth depth maps are discretized into levels.
binary classifiers are employed to train the depth estimation subnetwork, where theclassifier learns to predict whether the depth value of a given pixel is greater than the depth value belonging to label . To train the classifiers, a
size ground-truth rank vector is created for every pixel. As an example, if the actual depth value of a given pixelbelongs to (, ], the ground-truth rank vector for the pixel is encoded as , such that the first values are set to 1 and remaining values are set to 0. Fig. (a)a shows the graphical representation of a sample rank vector. The depth estimation subnetwork outputs feature maps where every two consecutive feature maps correspond to the output of a binary classifier. The pixel-wise ordinal classification loss on this channel output is computed as follows:
In Eq. 2, is the ground-truth depth label. This loss is computed over all the pixels indexed using width and height tuple (,). Here, is computed by softmax operation over and channel, where [, ).
To generate depth map from the classification output during inference time, Fu et al.  employ a naïve threshold technique and convert the estimated probability distribution of every pixel to a binary rank vector. Finally, the depth value of a pixel is set to , where denotes the count of
s in the binarized rank vector andrefers to the depth discretization threshold (see Section 3.2). The depth map inferred in this manner does not follow boundary and smoothness constraints. Moreover, as a result of this hard inference technique, the network cannot be trained in an end-to-end manner, leading to suboptimal results.
In this work, we first analyze the true and estimated probability distribution of the rank vector of a pixel (see Fig. 3). Mathematically, the area of the true distribution curve in Fig. (a)a corresponds to the true depth label of the pixel. Similarly, in the estimated distribution in Fig. (b)b, the area of the distribution curve corresponds to the expected depth label of the pixel. Hence, the expected label of a pixel can be computed from its estimated rank vector as follows:
This computation is fully differentiable and allows us to train the network in complete end-to-end fashion. It also enables continuous and smooth depth transitions. The expected depth labels (treated as in Eq. 1) obtained for all pixels using Eq. 3 are converted to approximate true depths (coarse depth in Fig. 2) using Eq. 1, the depth range [, ] is considered same as that of the training dataset.
Additionally, we propose to measure the confidence associated with coarse depth map estimation. The confidence measure for the estimated depth of a given pixel point (,
) can be defined as its variance from the expected depth label. Ideally, the estimated rank vector should have probabilities closer tobefore the expected depth label and probabilities closer to after the expected depth label. Hence, the confidence value of a pixel can be computed as follows:
Here, is the expected depth label for a pixel (,).
3.4 Structure Refinement Module
We add a structure refinement module to refine the coarse scale depth map. This is a residual block with two x convolutions which takes the coarse scale depth map, confidence map and output of multiscale feature fusion module as input to generate a refined depth map. Fig. 4 shows the design of multiscale feature fusion module which takes low-level feature maps from the encoder as input and upsamples them to a desired common scale. The upsampled low-level feature maps are then processed by different residual blocks with two convolution layers and finally all the feature maps are concatenated and merged using x convolution.
3.5 Pixel-wise Depth Regression Losses
The depth refinement subnetwork is trained using pixel-wise regression losses. We use natural logarithm of the absolute difference between the estimated and ground-truth depth and their gradients as our loss function. The weights of these two loss terms are determined empirically using the validation dataset. In Eq. 5a and 5b, and refer to estimated and ground-truth depth maps respectively.
4 Experiments and Analysis
4.1.1 NYU Depth V2
NYU Depth V2  dataset contains indoor scenes captured with Microsoft Kinect. Like existing methods, we train our model on predefined scenes and evaluate on test images. To reduce training time, we sampled training images from scenes which is x lesser than DORN . For training, we resize input images from x to x and randomly crop regions of size x. Similarly, test images are resized to x and the estimated depth map is upsampled to original resolution of x for comparison against ground-truth. We adopt the evaluation procedure of recent methods [3, 10, 16, 17] which use center crops of size x from estimated and ground-truth depth maps.
iBims-1  is a new benchmark which aims at evaluating depth estimation methods on important qualitative aspects, such as depth boundary errors, depth consistency and robustness to depth range. This dataset contains images with dense depth ground-truth and is only used for evaluation purpose. Thus, this dataset is also useful to test the generalization of SIDE methods.
4.2 Implementation Details
PyTorch framework was used for implementation. Adam optimization  was used with initial learning rate x and momentum term . A polynomial learning rate decay policy was applied with power term . The proposed model was trained on NYU Depth V2 dataset , while iBims-1 dataset was used only for evaluation. The model was trained for epochs using batch-size on NVIDIA P40 GPUs. Data augmentation in the form of random crop, brightness, contrast and color shift was performed on the fly.
|Eigen & Fergus ||0.158||-||0.639||77.1||95.0||98.8|
4.3 Ablation Study
In order to justify our algorithmic choices, we train and evaluate the following two models on NYU Depth V2 dataset: (a) Baseline: This model does not include confidence map computation and depth refinement submodule and it is trained using ordinal classification technique. (b) AcED: This model includes confidence map computation and depth refinement submodule and it is trained from scratch in end-to-end fashion with same settings as baseline model.
Table 1 shows the quantitative comparison between the baseline model and AcED. It can be seen that AcED achieves significant improvement in all the quantitative metrics, proving the benefit of the proposed network and loss function design. In Fig. 5, it can be observed that the confidence map displays low confidence values near small gaps and occlusion regions. It can be seen that the depth areas with low confidence values are corrected in the refined depth map.
4.4 Results and Discussions
Finally, we evaluate AcED against the state-of-the-art methods. Standard metrics, viz., mean absolute relative error (rel), mean error, root mean squared error (rms) and accuracy under different thresholds ( where ) are used for evaluation (for detailed description refer ). Additionally, we use the new metrics proposed in  for evaluating qualitative aspects, such as depth boundary error (DBE), directed depth error (DDE) and planarity error (PE). DDE measures accuracy of depth at a given plane, PE and OE together reflect the accuracy of object shapes.
Table 2 shows the quantitative comparison of AcED with state-of-the-art methods on NYU Depth V2 dataset. AcED outperforms the recent state-of-the-art on majority of metrics. The accuracy of AcED is percentage points better than the second best score. The slightly lower rms value of AcED can be attributed to spacing increasing discretization coupled with long-tailed depth distribution, which leads to increased error in far depth regions. In Fig. 6, the qualitative results show clear benefits of the proposed appraoch, the depth maps of AcED are visually closer to ground-truth, have smooth depth variations and sharp edges compared to DORN .
Table 3 and Fig. 7 respectively show the quantitative and qualitative results of AcED on iBims-1 dataset . Note that iBims-1 is used only for evaluation and its depth range is considerably different from NYU Depth V2. AcED tops the official leaderboard of iBims-1 benchmark with significant improvement in several metrics. AcED scores better on metrics associated with qualitative aspects, such as DDE, PE and DBE. The DBE of AcED is 17.5% lower than the second best score, indicating high accuracy of depth boundaries. The lower PE and OE values of AcED reveal that it is able to preserve object shapes in depth map much better than other methods.
It is important to note that [7, 16] perform multiple inferences or combine multiple depth estimates to generate final depth map. In contrast, AcED performs depth estimation in one forward pass. Furthermore, DORN  uses depth discretization levels and training samples, whereas AcED is trained with discretization levels and only training samples. AcED still outperforms DORN  which can be attributed to the proposed novel formulations which enable end-to-end optimization of the network.
Finally, Fig. 8 demonstrates the practical utility of AcED in the challenging single camera bokeh application. AcED was first trained using our in-house synthetic dataset  containing realistic human centric images with dense depth ground-truth. To reduce the computation load for this task, the light weight MobileNet V2  model was employed as the backbone encoder network. The depth maps generated by AcED on real life images were combined with our human segmentation mask  to apply realistic bokeh effect with varying background blur. The challenging multi-person use-case in first row of Fig. 8 shows impressive bokeh result owing to accurate and edge consistent depth map generated by AcED.
A novel deep learning based two stage depth estimation model was proposed. This is the first work in literature to propose a two stage approach comprising of ordinal regression and pixel-wise regression for depth estimation. This work proposed a fully differentiable variant of ordinal regression for depth estimation. A novel confidence map computation method for depth refinement was also proposed. Systematic experiments were performed and the benefits of the proposed novel formulations were evaluated in the ablation study. The proposed model significantly outperformed the recent state-of-the-art methods on challenging benchmark datasets and also achieved top rank on one benchmark. The utility of the proposed model in a challenging practical application was also demonstrated.
-  (2018) Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Trans. Circuits Syst. Video Techn.. Cited by: §1, §1, §2.
-  (2018) On regression losses for deep depth estimation. In ICIP, Cited by: §1.
-  (2016) Depth from a single image by harmonizing overcomplete local network predictions. In NeurIPS, Cited by: §4.1.1, Table 2.
-  Single-image depth perception in the wild. In NeurIPS, Cited by: §1.
-  (2015) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In ICCV, Cited by: §1, §2, Table 3.
-  (2014) Depth map prediction from a single image using a multi-scale deep network. In NeurIPS, Cited by: §1, §2, §4.4, Table 2.
-  (2018) Deep ordinal regression network for monocular depth estimation. In CVPR, Cited by: Figure 1, (c)c, §1, §1, §2, §3.2, §3.2, §3.3, (c)c, (c)c, §4.1.1, §4.2, §4.4, §4.4, Table 2, Table 3.
-  (2018) Detail preserving depth estimation from a single image using attention guided networks. In International Conference on 3D Vision 3DV, Cited by: §1, §2.
-  (2016) Deep residual learning for image recognition. In CVPR, Cited by: §3.1.1.
-  (2018) Monocular depth estimation using whole strip masking and reliability-based refinement. In ECCV, Cited by: §1, §2, §4.1.1, Table 2.
-  (2018) Squeeze-and-excitation networks. In CVPR, Cited by: §3.1.1.
-  (2017) Densely connected convolutional networks. In CVPR, Cited by: §3.1.1.
-  (2015) Adam: A method for stochastic optimization. In ICLR, Cited by: §4.2.
-  (2018) Evaluation of cnn-based single-image depth estimation methods. In ECCV Workshops, Cited by: §2, §4.1.2, §4.4, §4.4.
-  (2016) Deeper depth prediction with fully convolutional residual networks. In International Conference on 3D Vision 3DV, Cited by: §1, §2, Table 3.
-  (2018) Single-image depth estimation based on fourier domain analysis. In CVPR, Cited by: §1, §2, §4.1.1, §4.4, Table 2.
-  (2019) Monocular depth estimation using relative depth maps. In CVPR, Cited by: §1, §4.1.1, Table 2.
-  (2017) A two-streamed network for estimating fine-scaled depth maps from single RGB images. In ICCV, Cited by: §1, §2, Table 2, Table 3.
-  (2018) Deep attention-based classification network for robust depth prediction. In ACCV, Cited by: §1, §2.
-  (2018) PlaneNet: piece-wise planar reconstruction from a single RGB image. In CVPR, Cited by: §1, §2, Table 2, Table 3.
-  (2015) Deep convolutional neural fields for depth estimation from a single image. In CVPR, Cited by: §1, §2, Table 3.
GeoNet: geometric neural network for joint depth and surface normal estimation. In CVPR, Cited by: §1, Table 2.
-  (2019) SharpNet: fast and accurate recovery of occluding contours in monocular depth estimation. In ICCV Workshops, Cited by: Table 3.
-  (2018) MobileNetV2: inverted residuals and linear bottlenecks. In CVPR, Cited by: §4.4.
-  (2012) Indoor segmentation and support inference from RGBD images. In ECCV, Cited by: §4.1.1.
-  (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: §3.1.1.
-  (2019) DISCO: depth inference from stereo using context. In IEEE ICME, Cited by: §4.4.
-  (2018) Synthetic depth-of-field with a single-camera mobile phone. ACM Trans. Graph.. Cited by: §1, §4.4.
-  (2017) Multi-scale continuous crfs as sequential deep networks for monocular depth estimation. In CVPR, Cited by: §1, Table 2.
-  (2018) Structured attention guided convolutional neural fields for monocular depth estimation. In CVPR, Cited by: §1, §2, Table 2.
-  (2018) LA-net: layout-aware dense network for monocular depth estimation. In ACM Multimedia, Cited by: §1, §2, Table 2.
-  (2015) Learning ordinal relationships for mid-level vision. In ICCV, Cited by: §1.
-  (2019) Mean-variance loss for monocular depth estimation. In ICIP, Cited by: Table 2.