Monocular depth estimation, which plays a crucial role in understanding 3D scene geometry, is an ill-posed problem. Recent methods have gained significant improvement by exploring image-level information and hierarchical features from deep convolutional neural networks (DCNNs). These methods model depth estimation as a regression problem and train the regression networks by minimizing mean squared error, which suffers from slow convergence and unsatisfactory local solutions. Besides, existing depth estimation networks employ repeated spatial pooling operations, resulting in undesirable low-resolution feature maps. To obtain high-resolution depth maps, skip-connections or multi-layer deconvolution networks are required, which complicates network training and consumes much more computations. To eliminate or at least largely reduce these problems, we introduce a spacing-increasing discretization (SID) strategy to discretize depth and recast depth network learning as an ordinal regression problem. By training the network using an ordinary regression loss, our method achieves much higher accuracy and faster convergence in synch. Furthermore, we adopt a multi-scale network structure which avoids unnecessary spatial pooling and captures multi-scale information in parallel. The method described in this paper achieves state-of-the-art results on four challenging benchmarks, i.e., KITTI , ScanNet , Make3D , and NYU Depth v2 , and win the 1st prize in Robust Vision Challenge 2018. Code has been made available at: https://github.com/hufu6371/DORN.READ FULL TEXT VIEW PDF
Monocular depth estimation, which plays a key role in understanding 3D s...
This paper aims at understanding the role of multi-scale information in ...
Convolutional Neural Networks have demonstrated superior performance on
Estimating accurate depth from a single image is challenging, because it...
Depth estimation is a traditional computer vision task, which plays a cr...
We study the problem of estimating the relative depth order of point pai...
Defocus Blur Detection(DBD) aims to separate in-focus and out-of-focus
Estimating depth from 2D images is a crucial step of scene reconstruction and understanding tasks, such as 3D object recognition, segmentation, and detection. In this paper, we examine the problem of Monocular Depth Estimation from a single image (abbr. as MDE hereafter).
Compared to depth estimation from stereo images or video sequences, in which significant progresses have been made [21, 31, 28, 46], the progress of MDE is slow. MDE is an ill-posed problem: a single 2D image may be produced from an infinite number of distinct 3D scenes. To overcome this inherent ambiguity, typical methods resort to exploiting statistically meaningful monocular cues or features, such as perspective and texture information, object sizes, object locations, and occlusions [51, 26, 34, 50, 28].
, demonstrating that deep features are superior to handcrafted features. These methods address the MDE problem by learning a DCNN to estimate the continuous depth map. Since this problem is a standard regression problem, mean squared error (MSE) in log-space or its variants are usually adopted as the loss function. Although optimizing a regression network can achieve a reasonable solution, we find that the convergence is rather slow and the final solution is far from satisfactory.
usually apply standard DCNNs designed initially for image classification in a full convolutional manner as the feature extractors. In these networks, repeated spatial pooling quickly reduce the spatial resolution of feature maps (usually stride of 32), which is undesirable for depth estimation. Though high-resolution depth maps can be obtained by incorporating higher-resolution feature maps via multi-layer deconvolutional networks[35, 17, 33], multi-scale networks [40, 11] or skip-connection , such a processing would not only require additional computational and memory costs, but also complicate the network architecture and the training procedure.
In contrast to existing developments for MDE, we propose to discretize continuous depth into a number of intervals and cast the depth network learning as an ordinal regression problem, and present how to involve ordinal regression into a dense prediction task via DCNNs. More specifically, we propose to perform the discretization using a spacing-increasing discretization (SID) strategy instead of the uniform discretization (UD) strategy, motivated by the fact that the uncertainty in depth prediction increases along with the underlying ground-truth depth, which indicates that it would be better to allow a relatively larger error when predicting a larger depth value to avoid over-strengthened influence of large depth values on the training process. After obtaining the discrete depth values, we train the network by an ordinal regression loss, which takes into account the ordering of discrete depth values.
To ease network training and save computational cost, we introduce a network architecture which avoids unnecessary subsampling and captures multi-scale information in a simpler way instead of skip-connections. Inspired by recent advances in scene parsing [62, 4, 6, 64], we first remove subsampling in the last few pooling layers and apply dilated convolutions to obtain large receptive fields. Then, multi-scale information is extracted from the last pooling layer by applying dilated convolution with multiple dilation rates. Finally, we develop a full-image encoder which captures image-level information efficiently at a significantly lower cost of memory than the fully-connected full-image encoders [2, 12, 11, 37, 30]. The whole network is trained in an end-to-end manner without stage-wise training or iterative refinement. Experiments on four challenging benchmarks, i.e., KITTI , ScanNet , Make3D [51, 50] and NYU Depth v2 , demonstrate that the proposed method achieves state-of-the-art results, and outperforms recent algorithms by a significant margin.
The remainder of this paper is organized as follows. After a brief review of related literatures in Sec. 2, we present in Sec. 3 the proposed method in detail. In Sec. 4, besides the qualitative and quantitative performance on those benchmarks, we also evaluate multiple basic instantiations of the proposed method to analyze the effects of those core factors. Finally, we conclude the whole paper in Sec. 5.
Depth Estimation is essential for understanding the 3D structure of scenes from 2D images. Early works focused on depth estimation from stereo images by developing geometry-based algorithms [52, 14, 13] that rely on point correspondences between images and triangulation to estimate the depth. In a seminal work , Saxena et al
. learned the depth from monocular cues in 2D images via supervised learning. Since then, a variety of approaches have been proposed to exploit the monocular cues using handcrafted representations[51, 26, 34, 38, 8, 32, 1, 55, 47, 16, 22, 61]. Since handcrafted features alone can only capture local information, probabilistic graphic models such as Markov Random Fields (MRFs) are often built based on these features to incorporate long-range and global cues [51, 65, 41]. Another successful way to make use of global cues is the DepthTransfer method  which uses GIST global scene features  to search for candidate images that are “similar” to the input image from a database containing RGBD images.
Given the success of DCNNs in image understanding, many depth estimation networks have been proposed in recent years [20, 63, 37, 42, 54, 58, 48, 40, 29]. Thanks to multi-level contextual and structural information from powerful very deep networks (e.g., VGG  and ResNet ), depth estimation has been boosted to a new accuracy level [11, 17, 33, 35, 59]. The main hurdle is that the repeated pooling operations in these deep feature extractors quickly decrease the spatial resolution of feature maps (usually stride 32). Eigen et al. [12, 11] applied multi-scale networks which stage-wisely refine estimated depth map from low spatial resolution to high spatial resolution via independent networks. Xie et al.  adopted the skip-connection strategy to fuse low-spatial resolution depth map in deeper layers with high-spatial resolution depth map in lower layers. More recent works [17, 33, 35] apply multi-layer deconvolutional networks to recover coarse-to-fine depth. Rather than solely relying on deep networks, some methods incorporate conditional random fields to further improve the quality of estimated depth maps [57, 40]. To improve efficiency, Roy and Todorovic  proposed the Neural Regression Forest method which allows for parallelizable training of “shallow” CNNs.
Recently, unsupervised or semi-supervised learning is introduced to learn depth estimation networks[17, 33]. These methods design reconstruction losses to estimate the disparity map by recovering a right view with a left view. Also, some weakly-supervised methods considering pair-wise ranking information were proposed to roughly estimate and compare depth [66, 7].
Ordinal Regression [25, 23] aims to learn a rule to predict labels from an ordinal scale. Most literatures modify well-studied classification algorithms to address ordinal regression algorithms. For example, Shashua and Levin  handled multiple thresholds by developing a new SVM. Cammer and Singer 
generalized the online perceptron algorithms with multiple thresholds to do ordinal regression. Another way is toformulate ordinal regression as a set of binary classification subproblems. For instance, Frank and Hall 44].
This section first introduces the architecture of our deep ordinal regression network; then presents the SID strategy to divide continuous depth values into discrete values; and finally details how the network parameters can be learned in the ordinal regression framework.
As shown in Fig. 2, the divised network consists of two parts, i.e.,
a dense feature extractor and a scene understanding modular, and outputs multi-channel dense ordinal labels given an image.
usually apply standard DCNNs originally designed for image recognition as the feature extractor. However, the repeated combination of max-pooling and striding significantly reduces the spatial resolution of the feature maps. Also, to incorporate multi-scale information and reconstruct high-resolution depth maps, some partial remedies, including stage-wise refinement[12, 11], skip connection  and multi-layer deconvolution network [17, 33, 35] can be adopted, which nevertheless not only requires additional computational and memory cost, but also complicates the network architecture and the training procedure. Following some recent scene parsing network [62, 4, 6, 64], we advocate removing the last few downsampling operators of DCNNs and inserting holes to filters in the subsequent layers, called dilated convolution, to enlarge the field-of-view of filters without decreasing spatial resolution or increasing number of parameters.
The scene understanding modular consists of three parallel components, i.e., an atrous spatial pyramid pooling (ASPP) module [5, 6], a cross-channel leaner, and a full-image encoder. ASPP is employed to extract features from multiple large receptive fields via dilated convolutional operations. The dilation rates are 6, 12 and 18, respectively. The pure convolutional branch can learn complex cross-channel interactions. The full-image encoder captures global contextual information and can greatly clarify local confusions in depth estimation [57, 12, 11, 2].
Though previous methods have incorporated full-image encoders, our full-image encoder contains fewer parameters. As shown in Fig. 3, to obtain global feature with dimension from with dimension , a common fc-fashion method accomplishes this by using fully-connected layers, where each element in connects to all the image features, implying a global understanding of the entire image. However, this method contains a prohibitively large number of parameters, which is difficult to train and is memory consuming. In contrast, we first make use of an average pooling layer with a small kernel size and stride to reduce the spatial dimensions, followed by a
layer to obtain a feature vector with dimension. Then, we treat the feature vector as channels of feature maps with spatial dimensions of , and add a layer with the kernel size of as a cross-channel parametric pooling structure. Finally, we copy the feature vector to along spatial dimensions so that each location of share the same understanding of the entire image.
The obtained features from the aforementioned components are concatenated to achieve a comprehensive understanding of the input image. Also, we add two additional convolutional layers with the kernel size of , where the former one reduces the feature dimension and learns complex cross-channel interactions, and the later one transforms the features into multi-channel dense ordinal labels.
To quantize a depth interval into a set of representative discrete values, a common way is the uniform discretization (UD). However, as the depth value becomes larger, the information for depth estimation is less rich, meaning that the estimation error of larger depth values is generally larger. Hence, using the UD strategy would induce an over-strengthened loss for the large depth values. To this end, we propose to perform the discretization using the SID strategy (as shown in Fig. 4), which uniformed discretizes a given depth interval in space to down-weight the training losses in regions with large depth values, so that our depth estimation network is capable to more accurately predict relatively small and medium depth and to rationally estimate large depth values. Assuming that a depth interval needs to be discretized into sub-intervals, UD and SID can be formulated as:
where are discretization thresholds. In our paper, we add a shift to both and to obtain and so that , and apply SID on .
After obtaining the discrete depth values, it is straightforward to turn the standard regression problem into a multi-class classification problem, and adopts softmax regression loss to learn the parameters in our depth estimation network. However, typical multi-class classification losses ignore the ordered information between the discrete labels, while depth values have a strong ordinal correlation since they form a well-ordered set. Thus, we cast the depth estimation problem as an ordinal regression problem and develop an ordinal loss to learn our network parameters.
Let denote the feature maps of size given an image , where is the parameters involved in the dense feature extractor and the scene understanding modular. of size denotes the ordinal outputs for each spatial locations, where contains weight vectors. And is the discrete label produced by SID at spatial location . Our ordinal loss is defined as the average of pixelwise ordinal loss over the entire image domain:
where , and is the estimated discrete value decoding from . We choose softmax function to compute from and as follows:
where , and . Minimizing ensures that predictions farther from the true label incur a greater penalty than those closer to the true label.
The minimization of can be done via an iterative optimization algorithm. Taking derivate with respect to , the gradient takes the following form:
where , and is an indicator function such that (true) = 1 and
(false) = 0. We the can optimize our network via backpropagation.
In the inference phase, after obtaining ordinal labels for each position of image , the predicted depth value is decoded as:
|Method||abs rel.||imae||irmse||log mae||log rmse||mae||rmse||scale invar.||sq. rel.|
|Method||cap||higher is better||lower is better|
|Abs Rel||Squa Rel||RMSE|
|Make3D ||0 - 80 m||0.601||0.820||0.926||0.280||3.012||8.734||0.361|
|Eigen et al. ||0 - 80 m||0.692||0.899||0.967||0.190||1.515||7.156||0.270|
|Liu et al. ||0 - 80 m||0.647||0.882||0.961||0.217||1.841||6.986||0.289|
|LRC (CS + K) ||0 - 80 m||0.861||0.949||0.976||0.114||0.898||4.935||0.206|
|Kuznietsov et al. ||0 - 80 m||0.862||0.960||0.986||0.113||0.741||4.621||0.189|
|DORN (VGG)||0 - 80 m||0.915||0.980||0.993||0.081||0.376||3.056||0.132|
|DORN (ResNet)||0 - 80 m||0.932||0.984||0.994||0.072||0.307||2.727||0.120|
|Garg et al. ||0 - 50 m||0.740||0.904||0.962||0.169||1.080||5.104||0.273|
|LRC (CS + K) ||0 - 50 m||0.873||0.954||0.979||0.108||0.657||3.729||0.194|
|Kuznietsov et al. ||0 - 50 m||0.875||0.964||0.988||0.108||0.595||3.518||0.179|
|DORN (VGG)||0 - 50 m||0.920||0.982||0.994||0.079||0.324||2.517||0.128|
|DORN (ResNet)||0 - 50 m||0.936||0.985||0.995||0.071||0.268||2.271||0.116|
|Method||C1 error||C2 error|
|Liu et al. ||-||-||-||0.379||0.148||-|
|Liu et al. ||0.335||0.137||9.49||0.338||0.134||12.60|
|Li et al. ||0.278||0.092||7.12||0.279||0.102||10.27|
|Liu et al. ||0.287||0.109||7.36||0.287||0.122||14.09|
|Roy et al. ||-||-||-||0.260||0.119||12.40|
|Laina et al. ||0.176||0.072||4.46||-||-||-|
|Kuznietsov et al. ||0.421||0.190||8.24||-||-||-|
To demonstrate the effectiveness of our depth estimator, we present a number of experiments examining different aspects of our approach. After introducing the implementation details, we evaluate our methods on three challenging outdoor datasets, i.e. KITTI , Make3D [50, 51] and NYU Depth v2 
. The evaluation metrics are following previous works[12, 40]. Some ablation studies based on KITTI are discussed to give a more detailed analysis of our method.
|Liu et al. ||-||-||-||0.335||0.127||1.06|
|Ladicky et al. ||0.542||0.829||0.941||-||-||-|
|Li et al. ||0.621||0.886||0.968||0.232||0.094||0.821|
|Wang et al. ||0.605||0.890||0.970||0.220||-||0.824|
|Roy et al. ||-||-||-||0.187||-||0.744|
|Liu et al. ||0.650||0.906||0.976||0.213||0.087||0.759|
|Eigen et al. ||0.769||0.950||0.988||0.158||-||0.641|
|Chakrabarti et al. ||0.806||0.958||0.987||0.149||-||0.620|
|Laina et al. ||0.629||0.889||0.971||0.194||0.083||0.790|
|Li et al. ||0.789||0.955||0.988||0.152||0.064||0.611|
|Laina et al. ||0.811||0.953||0.988||0.127||0.055||0.573|
|Li et al. ||0.788||0.958||0.991||0.143||0.063||0.635|
We implement our depth estimation network based on the public deep learning platformCaffe . The learning strategy applies a polynomial decay with a base learning rate of and the power of . Momentum and weight decay are set to and respectively. The iteration number is set to 300K for KITTI, 50K for Make3D, and 3M for NYU Depth v2, and batch size is set to 3. We find that further increasing the iteration number can only slightly improve the performance. We adopt both VGG-16  and ResNet-101  as our feature extractors, and initialize their parameters via the pre-trained classification model on ILSVRC . Since features in first few layers only contain general low-level information, we fixed the parameters of conv1 and conv2 blocks in ResNet
after initialization. Also, the batch normalization parameters inResNet are directly initialized and fixed during training progress. Data augmentation strategies are following . In the test phase, we split each image to some overlapping windows according the cropping method in the training phase, and obtain the predicted depth values in overlapped regions by averaging the predictions.
|Variant||Iteration||higher is better||lower is better|
|Abs Rel||Squa Rel||RMSE|
KITTI The KITTI dataset  contains outdoor scenes with images of resolution about captured by cameras and depth sensors in a driving car. All the 61 scenes from the “city”, “residential”, “road” and “Campus” categories are used as our training/test sets. We test on 697 images from 29 scenes split by Eigen et al. , and train on about 23488 images from the remaining 32 scenes. We train our model on a random crop of size . For some other details, we set the maximal ordinal label for KITTI as 80, and evaluate our results on a pre-defined center cropping following  with the depth ranging from to and to . Note that, a single model is trained on the full depth range, and is tested on data with different depth ranges.
Make3D The Make3D dataset [50, 51] contains 534 outdoor images, 400 for training, and 134 for testing, with the resolution of , and provides the ground truth depth map with a small resolution of . We reduce the resolution of all images to , and train our model on a random crop of size . Following previous works, we report C1 (depth range from to ) and C2 (depth range from to ) error on this dataset using three commonly used evaluation metrics [28, 40]. For the VGG model, we train our DORN on a depth range of to
from scratch (ImageNet model), and evaluate results using the same model forC1 and C2 . However, for ResNet, we learn two separate models for and respectively.
NYU Depth v2 The NYU Depth v2  dataset contains 464 indoor video scenes taken with a Microsoft Kinect camera. We train our DORN using all images (about 120K) from the 249 training scenes, and test on the 694-image test set following previous works. To speed up training, all the images are reduced to the resolution of from . And the model are trained on random crops of size . We report our scores on a pre-defined center cropping by Eigen .
The ScanNet  dataset is also a challenging benchmark which contains various indoor scenes. We train our model on the officially provided 24353 training and validation images with a random crop size of , and evaluate our method on the ScanNet online test server.
Performance Tab. 3 and Tab. 4 give the results on two outdoor datasets, i.e., KITTI and Make3D. It can be seen that our DORN improves the accuracy by in terms of all metrics compared with previous works in all settings. Some qualitative results are shown in Fig. 5 and Fig. 6. In Tab. 5, our DORN outperforms other methods on NYU Depth v2, which is one of the largest indoor benchmarks. The results suggest that our method is applicable to both indoor and outdoor data. We evaluate our method on the online KITTI evaluation server and the online ScanNet evaluation server. As shown in Tab. 2 and 1, our DORN significantly outperforms the officially provided baselines.
Depth discretization is critical to performance improvement, because it allows us to apply classification and ordinal regression losses to optimize the network parameters. According to scores in Tab. 6, training by regression on continuous depth seems to converge to a poorer solution than the other two methods, and our ordinal regression network achieves the best performance. There is an obvious gap between approaches where depth is discretized by SID and UD, respectively. Besides, when replacing our ordinal regression loss by an advantage regression loss (i.e. BerHu), our DORN still obtain much higher scores. Thus, we can conclude that: (i) SID is important and can further improve the performance compared to UD; (i) discretizing depth and training using a multi-class classification loss is better than training using regression losses; (iii) exploring the ordinal correlation among depth drives depth estimation networks to converge to even better solutions.
Furthermore, we also train the network using RMSE on discrete depth values obtained by SID, and report the results in Tab. 6. We can see that MSE-SID performs slightly better than MSE, which demonstrates that quantization errors are nearly ignorable in depth estimation. The benefits of discretization through the use of ordinal regression losses far exceeds the cost of depth discretization.
|w/o full-image encoder||0.906||0.092||0.143||0M|
From Tab. 7, a full-image encoder is important to further boost the performance. Our full-image encoder yields a little higher scores than fc type encoders [2, 12, 11, 37, 30], but significantly reduce the number of parameters. For example, we set to 512 (VGG), to 512, to 2048 (Eigen [12, 11]), and to 4 in Fig. 3. Because of limited computation resources, when implementing the fc-fashion encoder, we downsampled the resolution of using the stride of 3, and upsampled to the required resolution. With an input image of size , and will be and respectively in our network. The number of parameters in -fashion encoder and our encoder is , and is , respectively. From the experimental results and parameter analysis, it can be seen that our full-image encoder performs better while requires less computational resources.
To illustrate the sensitivity to the number of intervals, we discretizing depth into various number of intervals via SID. As shown in Fig. 7, with a range of 40 to 120 intervals, our DORN has a score in regarding , and a score in in terms of RMSE, and is thereby robust to a long range of depth interval numbers. We can also see that neither too few nor too many depth intervals are rational for depth estimation: too few depth intervals cause large quantization error, while too many depth intervals lose the advantage of discretization.
In this paper, we have developed an deep ordinal regression network (DORN) for monocular depth estimation MDE from a single image, consisting of a clean CNN architecture and some effective strategies for network optimization. Our method is motivated by two aspects: (i) to obtain high-resolution depth map, previous depth estimation networks require incorporating multi-scale features as well as full-image features in a complex architecture, which complicates network training and largely increases the computational cost; (ii) training a regression network for depth estimation suffers from slow convergence and unsatisfactory local solutions. To this end, we first introduced a simple depth estimation network which takes advantage of dilated convolution technique and a novel full-image encoder to directly obtain a high-resolution depth map. Moreover, an effective depth discretization strategy and an ordinal regression training loss were intergrated to improve the training of our network so as to largely increase the estimation accuracy. The proposed method achieves the state-of-the-art performance on the KITTI, ScanNet, Make3D and NYU Depth v2 datasets. In the future, we will investigate new approximations to depth and extend our framework to other dense prediction problems.
This research was supported by Australian Research Council Projects FL-170100117 and DP-180103424. This work was partially supported by SAP SE and CNRS INS2I-JCJC-INVISANA. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research. This research was partially supported by research grant from Pfizer titled ”Developing Statistical Method to Jointly Model Genotype and High Dimensional Imaging Endophenotype.” We were also grateful for the computational resources provided by Pittsburgh Super Computing grant number TG-ASC170024.