I Introduction
Estimating depth from 2D images is a key component of scene reconstruction and understanding tasks, such as 3D recognition, tracking, segmentation and detection [4, 5, 6, 7, 3, 8, 9, 10, 11, 12], etc. In this paper, we examine the problem of Monocular Depth Estimation (abbr. as MDE hereafter), namely the estimation of the depth map from a single image.
Compared to depth estimation from stereo images or video sequences, in which significant progress has been made [13, 14, 15, 16, 10, 17, 18, 19], progress in MDE has been slow. MDE is fundamentally an illposed problem: a single 2D image may be produced from an infinite number of distinct 3D scenes. Fortunately, the 2D image and the depth map are correlated, suggesting that the depth can still be predicted with considerable accuracy.
To overcome this inherent ambiguity, typical methods resort to exploiting statistically meaningful monocular cues or features, such as perspective and texture information, object sizes, object locations, and occlusions. Previous methods used handcrafted features for depth estimation [3, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29], but since the handcrafted features alone can only capture local information, probabilistic graphic models [30, 3] or depth transfer methods [18] have been introduced to incorporate long range global cues.
Buoyed by the success of deep convolutional neural networks (DCNNs) in object recognition and detection, several recent works have significantly improved the MDE performance by a large margin with the use of DCNNbased models [31, 32, 33, 34, 35, 36]
, demonstrating that deep features are superior to handcrafted features. The main advantage of DCNNs is that the hierarchical representations in a DCNN capture both local and global information. A stateoftheart method
[34] exploits the multiscale network which firstly learns to predict a coarse depth map using global information and then refines it using another network with local information to produce a fine depth map.Existing methods address the MDE problem by learning a CNN to estimate the continuous depth map. Since this problem is a standard regression problem, existing methods usually adopt the root mean squared error (RMSE) in logspace as the loss function. Although training with RMSE can achieve a reasonable solution when predicting a low resolution depth map, we find that the optimization tends to be difficult when we try to train networks to predict highresolution continuous maps. The stochastic gradient descent (SGD) optimization method usually produces a local solution with unsatisfactory training error in this case.
We hypothesize that a compromise between spatial and depth resolution can make the optimization easier, which is referred to as the “compromise principle” in this paper. According to the compromise principle, we avoid directly estimating a high spatial resolution continuous depth map by firstly estimating depth maps with reduced spatial or depth resolution. To reduce the depth resolution, we propose to transform the regression problem into a classification problem by discretizing the depth value into intervals. Employing the classification loss to train the network achieves lower RMSE on the training data than training with RMSE. Also, the low spatial resolution continuous map can be learned with considerable accuracy. Based on such a principle, we develop a regressionclassification cascade network (RCCN) which consists of two branches: 1) the regression branch predicting low spatial resolution continuous depth map from the fullyconnected layers, capable of capturing global scene information; and 2) the classification branch, which predicts high spatial resolution discrete depth maps from the convolutional layers to preserve finer spatial information. The two branches form a cascaded structure and are learned jointly in an endtoend fashion, which allows the classification and regression branches to benefit from each other. After accomplishing the RCCN learning stage, a refinement and a fusion networks are posted to refine the discrete depth map into a higher spatial resolution. Our network achieves competitive or stateoftheart performance on NYU Depth V2 [1], KITTI [2], and Make3D [30, 3] benchmarks, which are three challenging datasets commonly used for MDE.
Ii Related Work
Depth estimation is an important part of understanding the 3D structure of scenes from 2D images. Most prior works focused on estimating depth from stereo images by developing geometrybased algorithms [37, 38] that rely on point correspondences between images and triangulation to estimate the depth. Given accurate point correspondences, depth can be estimated deterministically from stereo images. Thus, stereo depth estimation has particularly benefitted largely from the advances in local feature matching and dense optical flow estimation techniques.
However, the geometrybased depth estimation algorithms for stereo images ignore the monocular cues in 2D images, which can also be used for depth estimation. In a seminal work [30], Saxena et al
. learned the depth from monocular cues in 2D images via supervised learning. Since then, a variety of approaches have been proposed to exploit the monocular cues using handcrafted representations
[3, 20, 21]. Since handcrafted features alone can only capture local information, probabilistic graphic models such as Markov Random Fields (MRFs) are often built on these features to incorporate longrange and global cues [3, 39, 40]. Another successful way to make use of global cues is the DepthTransfer method [18] which uses GIST global scene features [41] to search for candidate images that are “similar” to the input image from a database containing RGBD images. A warping procedure based on SIFT Flow [42] was then applied to the candidate image and the corresponding depth map to align them to the input image.Given the success of DCNNs in image understanding [43, 44, 45, 46], some DCNNs based depth prediction frameworks have recently been proposed in recent years [47, 32, 48, 36]. Xie et al. [49] predicted the disparity map by adopting multilevel convolutional features for recovering a right view from a leftview. Garg et al. [47] proposed an unsupervised framework to learn a deep depthestimation neural network. Liu et al. [31] jointly explored the capacity of DCNNs and continuous CRF in a unified deep structured network. Moreover, Wang et al. [32] captured depth values and semantic information in a scene with DCNNs, and integrated them in a twolayer hierarchical CRF to jointly predict depth values and semantic labels. To improve efficiency, Roy and Todorovic [33] proposed the Neural Regression Forest method which allowed for parallelisable training of “shallow” CNNs. To further incorporate global information in DCNNs, Wang et al. [50] proposed a method for surface normal prediction, where two independent networks were learned to exploit global and local information respectively. Eigen et al. [51, 34] proposed a multiscale network that firstly learned to predict depth at a coarse scale and then refined it using another network to produce finescale depth maps. Further, Iro et al. [52] suggested to adopt deeper network to learn better image representations towards depth estimation.
Most recently, considering limited labeled samples and expensive human resources, some impressive unsupervised and semisupervised [47, 49, 48, 36, 53, 54] methods were developed by posing monocular depth estimation as an image reconstruction problem. For example, Xie et al. [49], Garg et al. [47] and Godard et al. [48] address the problem of novel view synthesis, and design reconstruction losses to estimate the disparity map by recovering a right view from a left view. In further, Kuznietsov et al. [36] incorporated extra supervision via ground truth depth in aforementioned unsupervised frameworks to improve the training. Also, Zoran et al. [53] and Chen et al. [54] optimized depth estimation networks from selected patch pairs from images via pairwise ranking losses, where the ordinal annotation a pair is only closer to , further to and equal to .
Our RCCN method also explores global and local representations in the network for depth prediction in a supervised manner. However, our network architecture is motivated by exploiting the compromise between spatial and depth resolutions. Thus, instead of designing a stagewise refinement procedure as done in [51], we introduce a cascaded structure to learn low spatial resolution but high depth resolution continuous depth via regression and high spatial resolution but low depth resolution discrete depth via classification in an endtoend fashion. In addition, to exploit large receptive field and to alleviate information loss caused by downsampling operation, we introduce dilated convolution [55] to the discrete depth estimation branch. Finally, we employ the deconvolution technique [56] as the bridge between the different branches to balance the feature channels from each branch.
Iii Approach
Given an input (indoor or outdoor) image of size (: height and : width), we aim to predict its depth map by exploiting the compromise between spatial and depth resolution. The key component of our approach is the proposed regressionclassification cascaded network (RCCN). RCCN explicitly models a high spatial resolution discrete depth map by discretizing the possible depth interval into a set of discrete values and formulates the estimation of as a multiclass classification problem together with the regression of a low spatial resolution continuous depth map within the deep architecture. The obtained discrete depth map is further refined to obtain a discrete depth map in a higher spatial resolution via a refinement network. We describe the detailed architecture of these two networks in Fig. 1 to clearly show their structures and the connections. Last but not least, the depth maps of all three scales (i.e., , , and ) can be jointly considered within a fusion network, so as to achieve a continuous depth map ^{1}^{1}1In practice, if the resolution of the depth map output by the deep model is lower than
, a classic interpolation method (
e.g., linear interpolation) can be used to obtain the depth map at full resolution.. We present the three networks in detail below.^{2}^{2}2We introduce our approach based on the network setting for our experiment on NYU Depth V2 dataset and Tab. I provides the associated parameters, so as to facilitate the understanding of the whole approach. The network setting and parameters can be replaced by other appropriate ones..RCCN  layer  De  
convs  2  2  3  3  3        1  
chans  64  128  256  512  512  2048  2048  1  512  
kernel  3  3  3  3  3        4  
dilat                    
ratio  /2  /4  /8  16/  32/      /16  /8  
layer  Co  De  
convs  2  2  3  3  3    1  1  1  1  
chans  64  128  256  512  512    2048  2048  M  256  
kernel  3  3  3  3  3    3  1  1  4  
dilat          2            
ratio  /2  /4  /8  /8  /8  /8  /8  /8  /8  /4  
Refinement  layer  Co  
convs  2  2  3    1  1  1  1  
chans  64  128  256    1024  1024  1024  M  
kernel  3  3  3    3  3  1  1  
ratio  /4  /4  /4  /4  /4  /4  /4  /4  /4  /4 
Parameters and neurons of the proposed network for NYU Depth V2 dataset based on VGG.
: layer shared by the two branches in RCCN. /: layer of the continuous/discrete depth estimation branch. : layer of the refinement network. Co: concatenation layer. De: deconvolutional layer. M: the number of predefine subintervals.Iiia RegressionClassification Cascaded Network
The RegressionClassification Cascaded Network (RCCN) is a joint regressorclassifier. It serves as a twotier estimator that simultaneously predicts the initial continuous depth map and the discrete depth map . We choose to adopt a twostage cascaded network for modeling and implementing the regressionclassification network,
aiming to exploit the compromise between low spatial resolution continuous depth and high spatial resolution discrete depth. We also incorporate global scene information from the entire image and structural and contextual information in a large receptive field.
Regressing Continuous Depth: In this stage, the network aims to predict the low spatial resolution continuous depth map
from a global understanding of the entire image, by abstracting a representation feature vector from the whole image field of view. From such a representation, we learn specific nonlinear functions for all the pixels located at a predefined resolution (
).To this end, on top of the shared convolutional layers ( to ), additional convolutional layers ( and
) and maxpooling layers with downsampling are used to obtain deeper convolutional features at a coarse resolution. Then, after the pass of two fullyconnected (
) layers ( and ), the feature vector will contain highlevel information of the whole input image. Followed by a third layer () with outputs, each output represents the depth value of a spatial location within the predefined resolution, and connects to all the vector elements from the last layer, implying a global understanding of the entire image.This stage is supervised by manually labeled continuous depth values over the whole input image in stride 8 via root mean squared error in
space (RMSE). More specifically, we reshape the dimensional output vector to a map to obtain the predicted depth map at this stage to compare it with the target depth map .It should be noted that, the RMSE loss uses a
function to downweight the losses in regions with large depth values, which is commonly used as an evaluation metric. From a statistical view, let us consider the generation of data following
with uniformed distributed. It is easy to observe that the noise variance of is larger when is larger, implying that the observed depth value has a larger noise variance when its ground truth is larger. Hence, without , large depth values would induce an overstrengthened influence on the training process, which is not expected. It also motivates the SID method (instead of uniform quantization) for classification as below, which quantizes depth values with increasing intervals and whose advantage is quantitatively evaluated.Categorizing Approximate Depth: This stage categorizes each pixel to one of the predefined discrete depths in a higher spatial resolution, by taking the shared convolutional features and the previousstage depth map as inputs.
Specifically, in order to better exploit the compromise between spatial and depth resolution as well as the geometric contexts and physical properties of the image, we adopt a cascaded structure in which the previousstage depth map is fed into the classification network. The previousstage depth map is spatially coarse (low spatial resolution) but provides a globalfieldofview understanding of the input image. The shared convolutional features contain finer spatial information. Therefore, on top of the shared convolutional layers, additional convolutional layers ( and ) are used to obtain local structural and contextual information. In contrast to the regression branch, we skip the subsampling operation in the maxpooling layers and employ the dilated convolution technique () [55], which introduces zeros to increase the convolution field to exploit large receptive field information in the fine resolution feature map, as shown in Fig. 2. The predicted continuous depth map is simultaneously deconvoluted to multichannel feature maps with the same spatial resolution as . The deconvolution to multichannel features is an important component to balance the features from different branches. Followed by a concatenation layer, three extra convolutional layers ( to
) are applied to learn a richer representation and model the probabilities of the depth subintervals that each pixel belongs to.
This stage is supervised by the predefined discrete depths like semantic labels in segmentation tasks. We minimize the multinomial logistic loss to learn the network parameters.
As for discretization strategies, uniform discretization (UD) is a common way to obtain a set of representative values from a depth interval . However, considering the facts that the importance of a fixed interval (e.g., ) decreases when the depth ranges from small to large, we propose to use the following spacingincreasing discretization (SID) strategy so that the learned model pays more attention to estimating relatively small depths:
(1) 
where is a predefine subinterval number.
Method  higher is better  lower is better  
Abs Rel  Squa Rel  RMSE  
Make3D [3]  0.601  0.820  0.926  0.280  3.012  8.734  0.361 
Eigen et al. [51]  0.692  0.899  0.967  0.190  1.515  7.156  0.270 
Liu et al. [57]  0.647  0.882  0.961  0.217  1.841  6.986  0.289 
LRC (CS + K) [48]  0.861  0.949  0.976  0.114  0.898  4.935  0.206 
Kuznietsov et al. [36]  0.862  0.960  0.986  0.113  0.741  4.621  0.189 
RCCNVGG  0.870  0.970  0.993  0.110  0.620  4.029  0.160 
RCCNVGG  0.886  0.975  0.994  0.105  0.540  3.903  0.154 
RCCNResNet  0.911  0.979  0.993  0.084  0.386  3.072  0.136 
IiiB Learning
Let denote the feature vector containing elements outputting from , where is shared parameters in the conv blocks , and is the parameters in . are the feature maps with size outputting from , where is parameters in , and the deconvolutional layer between and , are parameters of . of size denotes predicted depth map in the regression stage, and of size denotes category outputs for each spatial locations, where are parameters of and . The loss function for our RCCN will take the form as:
(2) 
where is a indicator function, so that , and , , , , , is the number of subintervals, is the groundtruth discrete depth value in spatial location . The softmax regression for the classification stage computes as:
(3) 
where , and .
To minimize , taking derivate with respect to , we can obtain the gradient as:
(4) 
We compute and respectively as follow:
(5) 
where and
can be computed via chain rules and backpropagation when
( also parameterized by ).IiiC Post Refinement
IiiC1 Refinement network
By taking the input RGB image and the features of the last classification branches of RCCN as inputs, the refinement network incorporate multiscale features (implying different receptive field sizes) and refines the discrete depth map into a higher spatial resolution.
Via those convolution blocks ( to ), we obtain features at a resolution of the original image (a relative small receptive field). Similar to the second stage in RCCN, the features () are deconvoluted into the same resolution of with multichannels outputs. Then, convolutional layers ( to ) are applied on these two scales of features to obtain the refined discrete depth map. The supervised information in this refinement network is the same as the one in the second stage of RCCN, except for the higher spatial resolution. We initialize the trained parameters of using those of in our experiments^{3}^{3}3Note that are independent from ..
IiiC2 Fusion Network
The depth map from the refinement network already incorporates multiscale features in multiscale receptive fields, and on average achieves a more accurate depth map estimation than those two predicted maps of RCCN, but not for all individual pixels. There are are two possible reasons for such a phenomenon: (i) the refined depth map is still in discrete space; or (ii) depth estimation for some objects (especially for large objects) depends more on information from large receptive fields, and features from small receptive fields may introduce some noises. To address this, we integrate the depth maps from all three scales within the fusion network, which just consists of a few convolutional layers in our experiments. Also, the supervised information in the fusion network is the same as that in the first stage of RCCN except for the higher spatial resolution.
R  C  RRCN  RCCN  CCCN  RCCN  
0.732  0.806  0.762  0.852  0.835  0.741  0.750  
0.905  0.947  0.928  0.963  0.950  0.913  0.919  
0.951  0.985  0.968  0.990  0.982  0.956  0.960  
Abs Rel  0.172  0.142  0.162  0.123  0.132  0.168  0.162 
Squa Rel  1.105  0.892  1.060  0.763  0.893  1.092  1.071 
RMSE  5.829  4.711  5.105  4.235  5.102  5.476  5.265 
0.282  0.198  0.235  0.174  0.208  0.270  0.247 
Iv Experiments
To validate the compromise principle and demonstrate the effectiveness of RCCN, here we present a number of experiments examining different aspects of the approach. After introducing the common experimental settings, we evaluate our methods on three challenging datasets, i.e. NYU Depth V2 [1], KITTI [2], and Make3D [30, 3], via the error metrics using in previous works.
Method  higher is better  lower is better  
Abs Rel  Squa Rel  RMSE  
Make3D [3]  0.447  0.745  0.897  0.349  0.492  1.214  0.409 
DepthTransfer [18]  0.460  0.742  0.893  0.350  0.539  1.1  0.378 
Liu et al. [40]  0.475  0.770  0.911  0.335  0.442  1.06  0.362 
Ladicky et al. [21]  0.542  0.829  0.941         
Li et al. [58]  0.621  0.886  0.968  0.232    0.821   
Wang et al. [32]  0.605  0.890  0.970  0.220  0.210  0.745  0.262 
Roy et al. [33]        0.187    0.744   
Liu et al. [57]  0.650  0.906  0.976  0.213    0.759   
Eigen et al. [34]  0.769  0.950  0.988  0.158  0.121  0.641  0.214 
Chakrabarti et al. [59]  0.806  0.958  0.987  0.149  0.118  0.620  0.205 
Laina et al. [52]  0.629  0.889  0.971  0.194    0.790   
Xu et al. [60]  0.636  0.896  0.972  0.193    0.792   
Li et al. [61]  0.789  0.955  0.988  0.152    0.611   
Laina et al. [52]  0.811  0.953  0.988  0.127    0.573  0.195 
Li et al. [61]  0.788  0.958  0.991  0.143    0.635   
Xu et al. [60]  0.811  0.954  0.987  0.121    0.586   
RCCNVGG  0.753  0.937  0.983  0.165  0.138  0.607  0.213 
RCCNVGG  0.765  0.950  0.991  0.160  0.131  0.586  0.204 
RCCNResNet  0.807  0.957  0.992  0.136  0.116  0.564  0.199 
Iva Experimental Setting
IvA1 Implementation Details
In order to fairly compare the proposed method with the current stateoftheart methods, our method adopts both VGG16 [62] and ResNet101 [63] as our backbone. We initialise the parameters of RCCN in base convolutional layers via the pretrained classification model on ILSVRC [64]. The training procedures for RCCNVGG and RCCNResNet are a little different. For RCCNVGG, we directly train our network follows a polynomial decay with a base learning rate of , the power of , the momentum of , and the weight decay of . However, for RCCNResNet, we find that directly optimize our network with a large base learning rate resulting an unexpected divergence after some iterations, and a small base learning rate resulting slow convergence rate. To speed up training, we fixed all the parameters in the base convolutional layers, and first train the regression stage for a few iterations, and then train the two stage for some other iterations with a base learning rate of . And finally, we optimize all the parameters together in RCCNResNet with a base learning rate of
. Further, the networks in post refinement stages are independently trained with fixed parameters of RCCN. The proposed method is implemented via a public deep learning paltform
Caffe [65], and trained on 4 TITAN X GPUs with 12GB of memory per GPU with batch size of 4.IvA2 Data Augmentation
Following previous works [51, 52], we employ some data augmentation techniques to prevent overfitting and to learn a better model in the training process, including: (i) Random Cropping: we randomly crop rectangles with predefined sizes from the original image. (ii) Flipping: we randomly flip the original image horizontally. (iii) Scaling: we randomly resize the original image by a scale factor belongs to the interval of , and normalize the associated depth map with the corresponding scales. (iv) Rotation: we randomly rotate the input image with the degree of .
IvA3 Evaluation Metrics
Below are the list of depth error metrics based on which the quantitative evaluation is performed:
(6) 
Algorithm  C1 error  C2 error  
Abs Rel  Ave  RMSE  Abs Rel  Ave  RMSE  
Make3D [3]        0.370  0.187   
Liu et al. [66]        0.379  0.148   
DepthTransfer [18]  0.355  0.127  9.20  0.361  0.148  15.10 
Liu et al. [40]  0.335  0.137  9.49  0.338  0.134  12.60 
Li et al. [58]  0.278  0.092  7.12  0.279  0.102  10.27 
Liu et al. [57]  0.287  0.109  7.36  0.287  0.122  14.09 
Roy et al. [33]        0.260  0.119  12.40 
Laina et al. [52]  0.176  0.072  4.46       
LRCDeep3D [49]  1.000  2.527  19.11       
LRC [48]  0.443  0.156  11.513       
Kuznietsov et al. [36]  0.421  0.190  8.24       
Xu et al. [60]  0.184  0.065  4.38  0.198  4.53  8.56 
RCCNVGG  0.252  0.104  8.82  0.255  0.106  11.57 
RCCNResNet  0.189  0.082  5.57  0.192  0.088  9.34 
IvB Kitti
The KITTI dataset [2] contains some outdoor scenes captured by cameras and depth sensors in a driving car. All the 61 scenes from the “city”, “residential”, and “road” categories consist of our training/test sets. We test on 697 images from 29 scenes split by [51], and train on 23,486 images from the remaining 32 scenes. All the images are resize to a resolution of from . We train our model on a random crop of size . At the test time, we split each image to 3 overlapping windows and obtain the predicted depth values in overlapped regions by averaging the 3 predictions. The evaluation metrics are computed on a predefined center cropping by Eigen et al. [51]
in the original resolution. Note that, since the ground truth depth are provided for only about 15% of points within the bottom part of the image, some depth targets in the bottom parts are filled using the colorization routine in the NYU Depth development kit
[67] for our training images following [51].Besides those representative quantitative results shown in Fig. 6, we summarize the quantitative evaluation in Tab. II, which demonstrates that the proposed approach significantly outperforms those previous methods in all those considered error metrics.
To further exploit the compromise principle and demonstrate the effectiveness of the proposed RCCN, we learn some related network variants by directly regressing continuous depth, directly estimating discrete depth via classification, jointly modeling continuous depth via our network architecture, and learning a classificationclassification cascade network. From the quantitative results shown in Tab. III and Fig. 3, we can conclude that: 1) when treating depth estimation as a classification problem instead of regression, the network can converge to a better local solution on average; 2) imagereceptivefieldofview understanding, as well as local highlevel convolutional features indeed help deep networks better learn the depth distribution of a scene; 3) the compromise between spatial and depth resolutions simplifies network training; 4) The performances of RRCN and of RCCN deteriorate once removing the lowspatial resolution branch from them (leading to R, C, respectively), demonstrating that benefits ; 5) The comparison between and RCCN shows that also benefits ; and 6) From the observation that RCCN CCCN RRCN, It can be seen that high spatial resolution together with high depth resolution leads to worse results.
IvC NYU Depth V2
The NYU Depth V2 [1] contains 464 indoor video scenes taken with a Microsoft Kinect camera. We randomly sample half of the 120K images from the raw dataset according to the official split training scenes as our training sets, and test on the 694image test set. We train our model on a randomly crop of size . On top of those qualitative results in Fig. 4, we report in Tab. IV the quantitative results via several common metrics used in previous works [34, 31]. The predictions from our model yields comparable or stateoftheart results comparison with previous works. Specifically, the estimated coarse continuous depth outperforms the “Coarse+Fine” prediction of Eigen et al. [51]. The predicted high spatial resolution discrete depth in particular obtains an impressive improvement, demonstrating that both discrete depth and the proposed RCCN framework are effective.
IvD Make3D
The Make3D dataset [30, 3] contains 534 outdoor images, 400 for training, and 134 for testing, with the resolution of . Our model is trained on a random crop of size . In the test phase, we split the test images into some subimages, and use max pooling on the overlapping regions to obtain the final predictions. As shown in Tab. V, we report C1 and C2 error on this dataset [18]. We achieve stateoftheart performance in all error metrics.
IvE Generalization to Cityscapes
We also demonstrate the generalization ability of our model. Specifically, we test our model trained only on KITTI via images provided by Cityscapes [68], which is also a large benchmark for autodriving. As shown in Fig. 7, our KITTI model can capture the general scene layout and objects such as cars, trees and pedestrians much well in images from Cityscapes.
V Conclusion
In this paper, we have presented a deep CNN architecture for MDE. Based on the fact that training a network to estimate a high spatial resolution continuous depth map is difficult, we hypothesize to design network architectures according to the compromise principle that training a network to estimate a depth map with reduced spatial resolution or depth resolution is easier. According to the compromise principle, we propose a regressionclassification cascaded network to jointly model continuous depth and discrete depths in two branches. The proposed approach is validated on three widelyused and challenging datasets, where it achieves competitive or stateoftheart results. Moreover, specific experiments have been done and the obtained results demonstrate that our network is superior to its variants, which also validates the design of our approach to some extent. We will continue to investigate new methodologies to reduce the depth resolution and extend our framework to other challenging dense prediction problems.
References
 [1] P. K. Nathan Silberman, Derek Hoiem and R. Fergus, “Indoor segmentation and support inference from rgbd images,” in ECCV, 2012.
 [2] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” IJRR, 2013.
 [3] A. Saxena, M. Sun, and A. Y. Ng, “Make3d: Learning 3d scene structure from a single still image,” IEEE TPAMI, vol. 31, no. 5, pp. 824–840, 2009.
 [4] L. Sheng, J. Cai, T.J. Cham, V. Pavlovic, and K. N. Ngan, “A generative model for depthbased robust 3d facial pose tracking,” CVPR, 2017.
 [5] S. Savarese and L. FeiFei, “3d generic object categorization, localization and pose estimation,” in ICCV, 2007.
 [6] D. Sun, E. B. Sudderth, and M. J. Black, “Layered image motion with explicit occlusions, temporal consistency, and depth ordering,” in NIPS, 2010.
 [7] D. Hoiem, A. A. Efros, and M. Hebert, “Automatic photo popup,” ACM TOG, vol. 24, no. 3, pp. 577–584, 2005.
 [8] W. Sun, G. Cheung, P. A. Chou, D. Florencio, C. Zhang, and O. C. Au, “Rateconstrained 3d surface estimation from noisecorrupted multiview depth videos,” IEEE TIP, vol. 23, no. 7, pp. 3138–3151, 2014.
 [9] K. Müller, H. Schwarz, D. Marpe, C. Bartnik, S. Bosse, H. Brust, T. Hinz, H. Lakshman, P. Merkle, F. H. Rhee et al., “3d highefficiency video coding for multiview video and depth data,” IEEE TIP, vol. 22, no. 9, pp. 3366–3378, 2013.
 [10] T.Y. Chung, J.Y. Sim, and C.S. Kim, “Bitallocation algorithm with novel view synthesis distortion model for multiview video plus depth coding,” IEEE TIP, vol. 23, no. 8, pp. 3254–3267, 2014.
 [11] S. Hou, Z. Wang, and F. Wu, “Deeply exploit depth information for object detection,” in CVPRW, 2016.

[12]
M. Kiechle, S. Hawe, and M. Kleinsteuber, “A joint intensity and depth cosparse analysis model for depth map superresolution,” in
CVPR.  [13] H. Ha, S. Im, J. Park, H.G. Jeon, and I. S. Kweon, “Highquality depth from uncalibrated small motion clip,” in CVPR, 2016.
 [14] N. Kong and M. J. Black, “Intrinsic depth: Improving depth transfer with intrinsic images,” in ICCV, 2015.
 [15] Y. Liu, X. Cao, Q. Dai, and W. Xu, “Continuous depth estimation for multiview stereo,” in CVPR, 2009.
 [16] M. Poostchi, H. Aliakbarpour, R. Viguier, F. Bunyak, K. Palaniappan, and G. Seetharaman, “Semantic depth map fusion for moving vehicle detection in aerial video,” in CVPRW.
 [17] B. Li, L.Y. Duan, C.W. Lin, T. Huang, and W. Gao, “Depthpreserving warping for stereo image retargeting,” IEEE TIP, vol. 24, no. 9, pp. 2811–2826, 2015.
 [18] K. Karsch, C. Liu, and S. B. Kang, “Depth transfer: Depth extraction from video using nonparametric sampling,” IEEE TPAMI, vol. 36, no. 11, pp. 2144–2158, 2014.
 [19] A. Rajagopalan, S. Chaudhuri, and U. Mudenagudi, “Depth estimation and image restoration using defocused stereo pairs,” IEEE TPAMI, vol. 26, no. 11, pp. 1521–1525, 2004.
 [20] D. Hoiem, A. A. Efros, and M. Hebert, “Recovering surface layout from an image,” IJCV, vol. 75, no. 1, pp. 151–172, 2007.
 [21] L. Ladicky, J. Shi, and M. Pollefeys, “Pulling things out of perspective,” in CVPR, 2014.
 [22] J. Konrad, M. Wang, and P. Ishwar, “2dto3d image conversion by learning depth from examples,” in CVPRW, 2012.
 [23] A. A. Alatan and L. Onural, “Estimation of depth fields suitable for video compression based on 3d structure and motion of objects,” IEEE TIP, vol. 7, no. 6, pp. 904–908, 1998.
 [24] R. Lai, Y. Shi, K. Scheibel, S. Fears, R. Woods, A. W. Toga, and T. F. Chan, “Metricinduced optimal embedding for intrinsic 3d shape analysis,” in CVPR, 2010.
 [25] Z. Yang, Z. Xiong, Y. Zhang, J. Wang, and F. Wu, “Depth acquisition from density modulated binary patterns,” in CVPR, 2013.
 [26] I. Tosic and K. Berkner, “Light field scaledepth space transform for dense depth estimation,” in CVPRW, 2014.
 [27] J. Lin, X. Ji, W. Xu, and Q. Dai, “Absolute depth estimation from a single defocused image,” IEEE TIP, vol. 22, no. 11, pp. 4545–4550, 2013.
 [28] W. Dong, G. Shi, X. Li, K. Peng, J. Wu, and Z. Guo, “Colorguided depth recovery via joint local structural and nonlocal lowrank regularization,” IEEE TMM, vol. 19, pp. 293–301, 2017.
 [29] H. Sheng, S. Zhang, X. Cao, Y. Fang, and Z. Xiong, “Geometric occlusion analysis in depth estimation using integral guided filter for lightfield image,” IEEE TIP, vol. 26, pp. 5758–5771, 2017.
 [30] A. Saxena, S. H. Chung, and A. Y. Ng, “Learning depth from single monocular images,” in NIPS, 2006.
 [31] F. Liu, C. Shen, and G. Lin, “Deep convolutional neural fields for depth estimation from a single image,” in CVPR, 2015.
 [32] P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. Yuille, “Towards unified depth and semantic prediction from a single image,” in CVPR, 2015.
 [33] A. Roy and S. Todorovic, “Monocular depth estimation using neural regression forest,” in CVPR, 2016.
 [34] D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic labels with a common multiscale convolutional architecture,” in ICCV, 2015.
 [35] S. Kim, K. Park, K. Sohn, and S. Lin, “Unified depth prediction and intrinsic image decomposition from a single image via joint convolutional neural fields,” in ECCV, 2016.
 [36] Y. Kuznietsov, J. Stückler, and B. Leibe, “Semisupervised deep learning for monocular depth map prediction,” CVPR, 2017.
 [37] D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense twoframe stereo correspondence algorithms,” IJCV, vol. 47, no. 13, pp. 7–42, 2002.
 [38] D. Forsyth and J. Ponce, Computer Vision: a Modern Approach. Prentice Hall, 2002.
 [39] W. Zhuo, M. Salzmann, X. He, and M. Liu, “Indoor scene structure analysis for single image depth estimation,” in CVPR, 2015.
 [40] M. Liu, M. Salzmann, and X. He, “Discretecontinuous depth estimation from a single image,” in CVPR, 2014.
 [41] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” IJCV, vol. 42, no. 3, pp. 145–175, 2001.
 [42] C. Liu, J. Yuen, and A. Torralba, “Sift flow: Dense correspondence across scenes and its applications,” IEEE TPAMI, vol. 33, no. 5, pp. 978–994, 2011.
 [43] J. Dai, K. He, and J. Sun, “Instanceaware semantic segmentation via multitask network cascades,” in CVPR, 2016.
 [44] C. Szegedy, S. Ioffe, and V. Vanhoucke, “Inceptionv4, inceptionresnet and the impact of residual connections on learning,” arXiv preprint arXiv:1602.07261, 2016.
 [45] Y. Sun, X. Wang, and X. Tang, “Deep convolutional network cascade for facial point detection,” in CVPR, 2013.
 [46] G.J. Qi, “Hierarchically gated deep networks for semantic segmentation,” in CVPR, 2016.
 [47] R. Garg, G. Carneiro, and I. Reid, “Unsupervised cnn for single view depth estimation: Geometry to the rescue,” in ECCV, 2016.
 [48] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular depth estimation with leftright consistency,” CVPR, 2017.
 [49] J. Xie, R. Girshick, and A. Farhadi, “Deep3d: Fully automatic 2dto3d video conversion with deep convolutional neural networks,” in ECCV, 2016.
 [50] X. Wang, D. Fouhey, and A. Gupta, “Designing deep networks for surface normal estimation,” in CVPR, 2015.
 [51] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multiscale deep network,” in NIPS, 2014.
 [52] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” in 3DV, 2016.
 [53] D. Zoran, P. Isola, D. Krishnan, and W. T. Freeman, “Learning ordinal relationships for midlevel vision,” in ICCV, 2015.
 [54] W. Chen, Z. Fu, D. Yang, and J. Deng, “Singleimage depth perception in the wild,” in NIPS, 2016.
 [55] L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” arXiv:1606.00915, 2016.
 [56] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus, “Deconvolutional networks,” in CVPR, 2010.
 [57] F. Liu, C. Shen, G. Lin, and I. Reid, “Learning depth from single monocular images using deep convolutional neural fields,” IEEE TPAMI, vol. 38, no. 10, pp. 2024–2039, 2016.
 [58] B. Li, C. Shen, Y. Dai, A. van den Hengel, and M. He, “Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs,” in CVPR, 2015.
 [59] A. Chakrabarti, J. Shao, and G. Shakhnarovich, “Depth from a single image by harmonizing overcomplete local network predictions,” in NIPS, 2016.
 [60] D. Xu, E. Ricci, W. Ouyang, X. Wang, and N. Sebe, “Multiscale continuous crfs as sequential deep networks for monocular depth estimation,” CVPR, 2017.
 [61] J. Li, R. Klein, and A. Yao, “A twostreamed network for estimating finescaled depth maps from single rgb images,” in ICCV, 2017.
 [62] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” in ICLR, 2015.
 [63] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.

[64]
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei, “ImageNet Large Scale Visual Recognition Challenge,”
IJCV, vol. 115, no. 3, pp. 211–252, 2015.  [65] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014.
 [66] B. Liu, S. Gould, and D. Koller, “Single image depth estimation from predicted semantic labels,” in CVPR, 2010.
 [67] A. Levin, D. Lischinski, and Y. Weiss, “Colorization using optimization,” ACM TOG, vol. 23, no. 3, pp. 689–694, 2004.

[68]
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in
CVPR, 2016.
Comments
There are no comments yet.