This paper addresses the problem of estimating the depth map of a scene given a single RGB image. We propose a fully convolutional architecture, encompassing residual learning, to model the ambiguous mapping between monocular images and depth maps. In order to improve the output resolution, we present a novel way to efficiently learn feature map up-sampling within the network. For optimization, we introduce the reverse Huber loss that is particularly suited for the task at hand and driven by the value distributions commonly present in depth maps. Our model is composed of a single architecture that is trained end-to-end and does not rely on post-processing techniques, such as CRFs or other additional refinement steps. As a result, it runs in real-time on images or videos. In the evaluation, we show that the proposed model contains fewer parameters and requires fewer training data than the current state of the art, while outperforming all approaches on depth estimation. Code and models are publicly available.READ FULL TEXT VIEW PDF
Convolutional Neural Networks have demonstrated superior performance on
Accurate depth estimation from images is a fundamental task in many
Robust three-dimensional scene understanding is now an ever-growing area...
In this paper we tackle a very novel problem, namely height estimation f...
When building a geometric scene understanding system for autonomous vehi...
Object recognition on depth images using convolutional neural networks
We propose a novel, fully-convolutional conditional generative model cap...
Depth estimation from a single view is a discipline as old as computer vision and encompasses several techniques that have been developed throughout the years. One of the most successful among these techniques is Structure-from-Motion (SfM); it leverages camera motion to estimate camera poses through different temporal intervals and, in turn, estimate depth via triangulation from pairs of consecutive views. Alternatively to motion, other working assumptions can be used to estimate depth, such as variations in illumination  or focus .
In absence of such environmental assumptions, depth estimation from a single image of a generic scene is an ill-posed problem, due to the inherent ambiguity of mapping an intensity or color measurement into a depth value. While this also is a human brain limitation, depth perception can nevertheless emerge from monocular vision. Hence, it is not only a challenging task to develop a computer vision system capable of estimating depth maps by exploiting monocular cues, but also a necessary one in scenarios where direct depth sensing is not available or not possible. Moreover, the availability of reasonably accurate depth information is well-known to improve many computer vision tasks with respect to the RGB-only counterpart, for example in reconstruction , recognition , semantic segmentation  or human pose estimation .
For this reason, several works tackle the problem of monocular depth estimation. One of the first approaches assumed superpixels as planar and inferred depth through plane coefficients via Markov Random Fields (MRFs) . Superpixels have also been considered in [16, 20, 37], where Conditional Random Fields (CRFs) are deployed for the regularization of depth maps. Data-driven approaches, such as [10, 13], have proposed to carry out image matching based on hand-crafted features to retrieve the most similar candidates of the training set to a given query image. The corresponding depth candidates are then warped and merged in order to produce the final outcome.
Recently, Convolutional Neural Networks (CNNs) have been employed to learn an implicit relation between color pixels and depth[5, 6, 16, 19, 37]. CNN approaches have often been combined with CRF-based regularization, either as a post-processing step [16, 37]
or via structured deep learning
, as well as with random forests. These methods encompass a higher complexity due to either the high number of parameters involved in a deep network [5, 6, 19] or the joint use of a CNN and a CRF [16, 37]. Nevertheless, deep learning boosted the accuracy on standard benchmark datasets considerably, ranking these methods first in the state of the art.
In this work, we propose to learn the mapping between a single RGB image and its corresponding depth map using a CNN. The contribution of our work is as follows. First, we introduce a fully convolutional architecture to depth prediction, endowed with novel up-sampling blocks, that allows for dense output maps of higher resolution and at the same time requires fewer parameters and trains on one order of magnitude fewer data than the state of the art, while outperforming all existing methods on standard benchmark datasets [23, 29]. We further propose a more efficient scheme for up-convolutions and combine it with the concept of residual learning  to create up-projection blocks for the effective upsampling of feature maps. Last, we train the network by optimizing a loss based on the reverse Huber function (berHu) 
and demonstrate, both theoretically and experimentally, why it is beneficial and better suited for the task at hand. We thoroughly evaluate the influence of the network’s depth, the loss function and the specific layers employed for up-sampling in order to analyze their benefits. Finally, to further assess the accuracy of our method, we employ the trained model within a 3D reconstruction scenario, in which we use a sequence of RGB frames and their predicted depth maps for Simultaneous Localization and Mapping (SLAM).
Depth estimation from image data has originally relied on stereo vision [22, 32], using image pairs of the same scene to reconstruct 3D shapes. In the single-view case, most approaches relied on motion (Structure-from-Motion ) or different shooting conditions (Shape-from-Shading , Shape-from-Defocus ). Despite the ambiguities that arise in lack of such information, but inspired by the analogy to human depth perception from monocular cues, depth map prediction from a single RGB image has also been investigated. Below, we focus on the related work for single RGB input, similar to our method.
Classic methods on monocular depth estimation have mainly relied on hand-crafted features and used probabilistic graphical models to tackle the problem [8, 17, 29, 30], usually making strong assumptions about scene geometry. One of the first works, by Saxena 
, uses a MRF to infer depth from local and global features extracted from the image, while superpixels are introduced in the MRF formulation in order to enforce neighboring constraints. Their work has been later extended to 3D scene reconstruction . Inspired by this work, Liu  combine the task of semantic segmentation with depth estimation, where predicted labels are used as additional constraints to facilitate the optimization task. Ladicky  instead jointly predict labels and depths in a classification approach.
A second cluster of related work comprises non-parametric approaches for depth transfer [10, 13, 18, 20], which typically perform feature-based matching (GIST , HOG ) between a given RGB image and the images of a RGB-D repository in order to find the nearest neighbors; the retrieved depth counterparts are then warped and combined to produce the final depth map. Karsch  perform warping using SIFT Flow , followed by a global optimization scheme, whereas Konrad  compute a median over the retrieved depth maps followed by cross-bilateral filtering for smoothing. Instead of warping the candidates, Liu , formulate the optimization problem as a Conditional Random Field (CRF) with continuous and discrete variable potentials. Notably, these approaches rely on the assumption that similarities between regions in the RGB images imply also similar depth cues.
More recently, remarkable advances in the field of deep learning drove research towards the use of CNNs for depth estimation. Since the task is closely related to semantic labeling, most works have built upon the most successful architectures of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), often initializing their networks with AlexNet  or the deeper VGG . Eigen  have been the first to use CNNs for regressing dense depth maps from a single image in a two-scale architecture, where the first stage – based on AlexNet – produces a coarse output and the second stage refines the original prediction. Their work is later extended to additionally predict normals and labels with a deeper and more discriminative model – based on VGG – and a three-scale architecture for further refinement . Unlike the deep architectures of [5, 6], Roy and Todorovic  propose combining CNNs with regression forests, using very shallow architectures at each tree node, thus limiting the need for big data.
Another direction for improving the quality of the predicted depth maps has been the combined use of CNNs and graphical models [16, 19, 37]. Liu  propose to learn the unary and pairwise potentials during CNN training in the form of a CRF loss, while Li  and Wang  use hierarchical CRFs to refine their patch-wise CNN predictions from superpixel down to pixel level.
Our method uses a CNN for depth estimation and differs from previous work in that it improves over the typical fully-connected layers, which are expensive with respect to the number of parameters, with a fully convolutional model incorporating efficient residual up-sampling blocks, that we refer to as up-projections and which prove to be more suitable when tackling high-dimensional regression problems.
In this section, we describe our model for depth prediction from a single RGB image. We first present the employed architecture, then analyze the new components proposed in this work. Subsequently, we propose a loss function suitable for the optimization of the given task.
Almost all current CNN architectures contain a contractive part that progressively decreases the input image resolution through a series of convolutions and pooling operations, giving higher-level neurons large receptive fields, thus capturing more global information. In regression problems in which the desired output is a high resolution image, some form of up-sampling is required in order to obtain a larger output map. Eigen[5, 6], use fully-connected layers as in a typical classification network, yielding a full receptive field. The outcome is then reshaped to the output resolution.
We introduce a fully convolutional network for depth prediction. Here, the receptive field is an important aspect of the architectural design, as there are no explicit full connections. Specifically, assume we set an input of pixels (as in ) and predict an output map that will be at approximately half the input resolution. We investigate popular architectures (AlexNet , VGG-16 ) as the contractive part, since their pre-trained weights facilitate convergence. The receptive field at the last convolutional layer of AlexNet is pixels, allowing only very low resolution input images when true global information (monocular cues) should be captured by the network without fully-connected layers. A larger receptive field of is achieved by VGG-16 but still sets a limit to the input resolution. Eigen and Fergus  show a substantial improvement when switching from AlexNet to VGG, but since both their models use fully-connected layers, this is due to the higher discriminative power of VGG.
Recently, ResNet 
introduced skip layers that by-pass two or more convolutions and are summed to their output, including batch normalization after every convolution (see Fig. 1). Following this design, it is possible to create much deeper networks without facing degradation or vanishing gradients. Another beneficial property of these extremely deep architectures is their large receptive field; ResNet-50 captures input sizes of , large enough to fully capture the input image even in higher resolutions. Given our input size and this architecture, the last convolutional layers result in 2048 feature maps of spatial resolution pixels, when removing the last pooling layer. As we show later, the proposed model, which uses residual up-convolutions, produces an output of pixels. If we instead added a fully-connected layer of the same size, it would introduce billion parameters, worth GB in memory, rendering this approach impossible on current hardware. This further motivates our proposal of a fully convolutional architecture with up-sampling blocks that contain fewer weights while improving the accuracy of the predicted depth maps.
Our proposed architecture can be seen in Fig. 1. The feature map sizes correspond to the network trained for input size , in the case of NYU Depth v2 data set . The first part of the network is based on ResNet-50 and initialized with pre-trained weights. The second part of our architecture guides the network into learning its upscaling through a sequence of unpooling and convolutional layers. Following the set of these up-sampling blocks, dropout is applied and succeeded by a final convolutional layer yielding the prediction.
Unpooling layers [4, 21, 38], perform the reverse operation of pooling, increasing the spatial resolution of feature maps. We adapt the approach described in  for the implementation of unpooling layers, in order to double the size by mapping each entry into the top-left corner of a (zero) kernel. Each such layer is followed by a
convolution – so that it is applied to more than one non-zero elements at each location – and successively by ReLU activation. We refer to this block as up-convolution. Empirically, we stack four such up-convolutional blocks, 16x upscaling of the smallest feature map, resulting in the best trade-off between memory consumption and resolution. We found that performance did not increase when adding a fifth block.
We further extend simple up-convolutions using a similar but inverse concept to  to create up-sampling res-blocks. The idea is to introduce a simple convolution after the up-convolution and to add a projection connection from the lower resolution feature map to the result, as shown in Fig. 2(c). Because of the different sizes, the small-sized map needs to be up-sampled using another up-convolution in the projection branch, but since the unpooling only needs to be applied once for both branches, we just apply the convolutions separately on the two branches. We call this new up-sampling block up-projection since it extends the idea of the projection connection  to up-convolutions. Chaining up-projection blocks allows high-level information to be more efficiently passed forward in the network while progressively increasing feature map sizes. This enables the construction of our coherent, fully convolutional network for depth prediction. Fig. 2 shows the differences between an up-convolutional block to up-projection block. It also shows the corresponding fast versions that will be described in the following section.
One further contribution of this work is to reformulate the up-convolution operation so to make it more efficient, leading to a decrease of training time of the whole network of around 15%. This also applies to the newly introduced up-projection operation. The main intuition is as follows: after unpooling 75% of the resulting feature maps contain zeros, thus the following convolution mostly operates on zeros which can be avoided in our modified formulation. This can be observed in Fig. 3. In the top left the original feature map is unpooled (top middle) and then convolved by a filter. We observe that in an unpooled feature map, depending on the location (red, blue, purple, orange bounding boxes) of the filter, only certain weights are multiplied with potentially non-zero values. These weights fall into four non-overlapping groups, indicated by different colors and A,B,C,D in the figure. Based on the filter groups, we arrange the original filter to four new filters of sizes (A) , (B) , (C) and (D) . Exactly the same output as the original operation (unpooling and convolution) can now be achieved by interleaving the elements of the four resulting feature maps as in Fig. 3. The corresponding changes from a simple up-convolutional block to the proposed up-projection are shown in Fig. 2 (d).
A standard loss function for optimization in regression problems is the loss, minimizing the squared euclidean norm between predictions and ground truth : . Although this produces good results in our test cases, we found that using the reverse Huber (berHu) [25, 40] as loss function yields a better final error than .
The Berhu loss is equal to the norm when and equal to outside this range. The version used here is continuous and first order differentiable at the point where the switch from to occurs. In every gradient descent step, when we compute we set , where indexes all pixels over each image in the current batch, that is of the maximal per-batch error. Empirically, BerHu shows a good balance between the two norms in the given problem; it puts high weight towards samples/pixels with a high residual because of the term, contrary for example to a robust loss, such as Tukey’s biweight function that ignores samples with high residuals . At the same time, accounts for a greater impact of smaller residuals’ gradients than would.
We provide two further intuitions with respect to the difference between and berHu loss. In both datasets that we experimented with, we observe a heavy-tailed distribution of depth values, also reported in , for which Zwald and Lambert-Lacroix  show that the berHu loss function is more appropriate. This could also explain why [5, 6]
experience better convergence when predicting the log of the depth values, effectively moving a log-normal distribution back to Gaussian. Secondly we see the greater benefit of berHu in the small residuals during training as there thederivative is greater than ’s. This manifests in the error measures rel. and (Sec. 4), which are more sensitive to small errors.
In this section, we provide a thorough analysis of our methods, evaluating the different components that comprise the down-sampling and up-sampling part of the CNN architecture. We also report the quantitative and qualitative results obtained by our model and compare to the state of the art in two standard benchmark datasets for depth prediction, NYU Depth v2  (indoor scenes) and Make3D  (outdoor scenes).
For the implementation of our network we use MatConvNet , and train on a single NVIDIA GeForce GTX TITAN with 12GB of GPU memory. Weight layers of the down-sampling part of the architecture are initialized by the corresponding models (AlexNet, VGG, ResNet) pre-trained on the ILSVRC 
data for image classification. Newly added layers of the up-sampling part are initialized as random filters sampled from a normal distribution with zero mean and 0.01 variance.
The network is trained on RGB inputs to predict the corresponding depth maps. We use data augmentation to increase the number of training samples. The input images and corresponding ground truth are transformed using small rotations, scaling, color transformations and flips with a 0.5 chance, with values following Eigen . Finally, we model small translations by random crops of the augmented images down to the chosen input size of the network.
|NYU Depth v2||rel||rms||rms(log)|
|Roy and Todorovic ||0.187||0.744||-||0.078||-||-||-|
|Eigen and Fergus ||0.158||0.641||0.214||-||0.769||0.950||0.988|
First, we evaluate on one of the largest RGB-D data sets for indoor scene reconstruction, NYU Depth v2 . The raw dataset consists of 464 scenes, captured with a Microsoft Kinect, with the official split consisting in 249 training and 215 test scenes. For training, however, our method only requires a small subset of the raw distribution. We sample equally-spaced frames out of each training sequence, resulting in approximately 12k unique images. After offline augmentations of the extracted frames, our dataset comprises approximately 95k pairs of RGB-D images. We point out that our dataset is radically smaller than that required to train the model in [5, 6], consisting of 120k unique images, as well as the 800k samples extracted in the patch-wise approach of . Following , the original frames of size pixels are down-sampled to resolution and center-cropped to
pixels, as input to the network. At last, we train our model with a batch size of 16 for approximately 20 epochs. The starting learning rate isfor all layers, which we gradually reduce every 6-8 epochs, when we observe plateaus; momentum is 0.9.
For the quantitative evaluation of our methods and comparison to the state of the art on this data set, we compute various error measures on the commonly used test subset of 654 images. The predictions’ size depends on the specific model; in our configuration, which consists of four up-sampling stages, the corresponding output resolutions are for AlexNet, for VGG and for ResNet-based models. The predictions are then up-sampled back to the original size (
) using bilinear interpolation and compared against the provided ground truth with filled-in depth values for invalid pixels.
In Table 1 we compare different CNN variants of the proposed architecture, in order to study the effect of each component. First, we evaluate the influence of the depth of the architecture using the convolutional blocks of AlexNet, VGG-16 and ResNet-50. It becomes apparent that a fully convolutional architecture (UpConv) on AlexNet is outperformed by the typical network with full connections (FC). As detailed in Sec. 3.1, a reason for this is the relatively small field of view in AlexNet, which is not enough to capture global information that is needed when removing the fully-connected layers. Instead, using VGG as the core architecture, improves the accuracy on depth estimation. As a fully-connected VGG variant for high-dimensional regression would incorporate a high number of parameters, we only perform tests on the fully convolutional (UpConv) model here. However, a VGG-based model with fully-connected layers was indeed employed by  (for their results see Table 2) performing better than our fully convolutional VGG-variant mainly due to their multi-scale architecture, including the refinement scales.
Finally, switching to ResNet with a fully-connected layer (ResNet-FC) – without removing the final pooling layer – achieves similar performance to  for a low resolution output (), using 10 times fewer data; however increasing the output resolution () results in such a vast number of parameters that convergence becomes harder. This further motivates the reasoning for the replacement of fully-connected layers and the need for more efficient upsampling techniques, when dealing with high-dimensional problems. Our fully convolutional variant using simple up-convolutions (ResNet-UpConv) improves accuracy, and at last, the proposed architecture (ResNet-UpProj), enhanced with the up-projection blocks, gives by far the best results. As far as the number of parameters is concerned, we see a drastic decrease when switching from fully-connected layers to fully convolutional networks. Another common up-sampling technique that we investigated is deconvolution with successive kernels, but the up-projections notably outperformed it. Qualitatively, since our method consists in four successive up-sampling steps (2x resolution per block), it can preserve more structure in the output when comparing to the FC-variant (see Fig. 4).
In all shown experiments the berHu loss outperforms . The difference is higher in relative error which can be explained by the larger gradients of (berHu) over for small residuals; the influence on the relative error is higher, as there pixels in smaller distances are more sensitive to smaller errors. This effect is also well visible as a stronger gain in the challenging measure.
Finally, we measure the timing of a single up-convolutional block for a single image (1.5 ms) and compare to our up-projection (0.14 ms). This exceeds the theoretical speed up of 4 and is due to the fact that smaller filter sizes benefit more from the linearization inside cuDNN. Furthermore, one of the advantages of our model is the overall computation time. Predicting the depth map of a single image takes only 55ms with the proposed up-sampling (78ms with up-convolutions) on our setup. This enables real-time processing images, for example from a web-cam. Further speed up can be achieved when several images are processed in a batch. A batch size of 16 results in 14ms per image with up-projection and 28ms for up-convolutions.
In Table 2 we compare the results obtained by the proposed architecture to those reported by related work. Additionally, in Fig. 4 we qualitatively compare the accuracy of the estimated depth maps using the proposed approach (ResNet-UpProj) with that of the different variants (AlexNet, VGG, ResNet-FC-64x48) as well as with the publicly available predictions of Eigen and Fergus . One can clearly see the improvement in quality from AlexNet to ResNet, however the fully-connected variant of ResNet, despite its increased accuracy, is still limited to coarse predictions. The proposed fully convolutional model greatly improves edge quality and structure definition in the predicted depth maps.
Interestingly, our depth predictions exhibit noteworthy visual quality, even though they are derived by a single model, trained end-to-end, without any additional post-processing steps, as for example the CRF inference of [16, 37]. On the other hand,  refine their predictions through a multi-scale architecture that combines the RGB image and the original prediction to create visually appealing results. However, they sometimes mis-estimate the global scale (second and third row) or introduce noise in case of highly-textured regions in the original image, even though there is no actual depth border in the ground truth (last row). Furthermore, we compare to the number of parameters in , which we calculated as million for the three scales, that is approximately times more than our model. Instead, the CNN architecture proposed here is designed with feasibility in mind; the number of parameters should not increase uncontrollably in high-dimensional problems. This further means a reduction in the number of gradient steps required as well as the data samples needed for training. Our single network generalizes better and successfully tackles the problem of coarseness that has been encountered by previous CNN approaches on depth estimation.
In addition, we evaluated our model on Make3D data set  of outdoor scenes. It consists of 400 training and 134 testing images, gathered using a custom 3D scanner. As the dataset acquisition dates to several years ago, the ground truth depth map resolution is restricted to , unlike the original RGB images of pixels. Following , we resize all images to and further reduce the resolution of the RGB inputs to the network by half because of the large architecture and hardware limitations. We train on an augmented data set of around 15k samples using the best performing model (ResNet-UpProj) with a batch size of 16 images for 30 epochs. Starting learning rate is 0.01 when using the berHu loss, but it needs more careful adjustment starting at 0.005 when optimizing with . Momentum is 0.9. Please note that due to the limitations that come with the dataset, considering the low resolution ground truth and long range inaccuracies (sky pixels mapped at 80m), we train against ground truth depth maps by masking out pixels of distances over 70m.
In order to compare our results to state-of-the-art, we up-sample the predicted depth maps back to using bilinear interpolation. Table 3 reports the errors compared to previous work based on (C1) criterion, computed in regions of depth less than 70m as suggested by  and as implied by our training. As an aside,  pre-process the images with a per-pixel sky classification to also exclude them from training. Our method significantly outperforms all previous works when trained with either or berHu loss functions. In this challenging dataset, the advantage of berHu loss is more eminent. Also similarly to NYU, berHu improves the relative error more than the rms because of the weighting of close depth values. Qualitative results from this dataset are shown in Fig. 5.
To complement the previous results, we demonstrate the usefulness of depth prediction within a SLAM application, with the goal of reconstructing the geometry of a 3D environment. In particular, we deploy a SLAM framework where frame-to-frame tracking is obtained via Gauss-Newton optimization on the pixelwise intensity differences computed on consecutive frame pairs as proposed in , while fusion of depth measurements between the current frame and the global model is carried out via point-based fusion . We wish to point out that, to the best of our knowledge, this is the first demonstration of a SLAM reconstruction based on depth predictions from single images.
A qualitative comparison between the SLAM reconstructions obtained using the depth values estimated with the proposed ResNet-UpProj architecture against that obtained using the ground truth depth values on part of a sequence of the NYU Depth dataset is shown in Fig. 6. The figure also includes a comparison with the depth predictions obtained using AlexNet and VGG architectures. As it can be seen, the improved accuracy of the depth predictions, together with the good edge-preserving qualities of our up-sampling method, is not only noticeable in the qualitative results of Fig. 4, but also yields a much more accurate SLAM reconstruction compared to the other architectures. We wish to point out that, although we do not believe its accuracy could be yet compared to that achieved by methods exploiting temporal consistency for depth estimation such as SfM and monocular SLAM, our method does not explicitly rely on visual features to estimate depths, and thus holds the potential to be applied also on scenes characterized by low-textured surfaces such as walls, floors and other structures typically present in indoor environments. Although clearly outside the scope of this paper, we find these aspects relevant enough to merit future analysis.
In this work we present a novel approach to the problem of depth estimation from a single image. Unlike typical CNN approaches that require a multi-step process in order to refine their originally coarse depth predictions, our method consists in a powerful, single-scale CNN architecture that follows residual learning. The proposed network is fully convolutional, comprising up-projection layers that allow for training much deeper configurations, while greatly reducing the number of parameters to be learned and the number of training samples required. Moreover, we illustrate a faster and more efficient approach to up-convolutional layers. A thorough evaluation of the different architectural components has been carried out not only by optimizing with the typical l2 loss, but also with the berHu loss function, showing that it is better suited for the underlying value distributions of the ground truth depth maps. All in all, the model emerging from our contributions is not only simpler than existing methods, can be trained with less data in less time, but also achieves higher quality results that lead our method to state-of-the-art in two benchmark datasets for depth estimation.
Proc. Conf. Computer Vision and Pattern Recognition (CVPR), pages 886–893, 2005.
Proceedings of The 32nd International Conference on Machine Learning, pages 448–456, 2015.
Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs.In Proc. Conf. Computer Vision and Pattern Recognition (CVPR), pages 1119–1127, 2015.
A robust hybrid of lasso and ridge regression.Contemporary Mathematics, 443:59–72, 2007.