Scene depth inference from a single image is currently an important issue in machine learning[1, 2, 3, 4, 5]
. The underlying rationale of this problem is the possibility of human depth perception from single images. The task here is to assign a depth value to every single pixel in the image, which can be considered as a dense regression problem. Depth information can benefit many challenging computer vision problems, such as semantic segmentation[6, 7], pose estimation , and object detection .
During the past decade, significant effort has been made in the research community to improve the performance of monocular depth learning, and significant accuracy has been achieved thanks to the rapid development and advances of deep neural networks. However, most available methods overlook one key problem: the ambiguity between the scene depth and the camera’s focal length. Because the 3D-to-2D object imaging process must satisfy some strict projective geometric relationship, however, without prior knowledge on the camera’s intrinsic parameters, it is impossible to infer the true depth from a single image.
In this paper, in order to remove the ambiguity caused by the unknown focal length, we propose a novel deep neural network to learn the monocular depth by embedding the focal length information. However, the datasets used in most machine learning methods are all of fixed-focal-length, such as the NYU dataset , the Make3D dataset , and the KITTI dataset . To prepare for learning monocular depth with focal length, datasets with varying focal lengths are required so that the camera s intrinsic information should be taken into account in both the learning and the inference phases. However, considering the labor in building a new varying-focal-length dataset, it is desirable to transform the existing fixed-focal-length datasets into those of varying-focal-length. we first introduce a method to generate varying-focal-length dataset from fixed-focal-length dataset, like Make3D and NYU v2, and a simple and effective method is proposed to fill the holes produced during the image generation. The transformed datasets are demonstrated to make great contribution in depth estimation.
In order to learn fine-grained monocular depth with focal length, we propose an effective neural network to predict accurate depth, which achieves competitive performance as compared with the state-of-the-art methods, and further embedding the focal length information into the proposed model. In the literature, almost all works for pixel-wise prediction exploit an Encoder-Decoder network [12, 13] to infer the labels of pixels. To predict accurate labels, two general attempts have been made to address the problem. One is to integrate middle layer features [14, 15, 12, 16, 17], the other is to effectively exploit the multi-scale information and the decoder side outputs [3, 5, 18, 19]. Inspired by the idea of fusing the middle-level information, we propose a novel end-to-end neural network to learn fine-grained depth from single images with embedded focal length. The proposed network is composed of four parts: the first part is built on the pre-trained VGG models, followed by the global transformation layer and upsampling architecture to produce depth with high resolution, the third part effectively integrates the middle-level information to infer the structure details, converting the middle-level information to the space of the depth, and the last part embeds the focal length into the global information.
The proposed network is extensively evaluated on the Make3D, NYU v2, and KITTI datasets. We first perform the experiments without the embedded focal length, and better performance than the state-of-the-art techniques is achieved in both quantitative and qualitative terms. Then, it is further evaluated with the embedded focal length on the newly generated varying-focal-length datasets for comparison. The experimental results show that depths inferred from the model with embedded focal length significantly outperform those without the focal length in all error measures, it also demonstrates that the focal length information is very useful for the depth extraction from a single image.
In summary, the contributions of this paper are four-fold. First, we prove that the ambiguity between the focal length and the depth estimation from a single image, and further demonstrate the result using real images. Second, we propose a method to generate varying-focal-length images from fixed-focal-length images, which are visually plausible. Third, based on the classical Encoder-Decoder network, a novel neural network model is proposed to learn the fine-grained depth from single images, by virtue of effectively fusing the middle-level information. Finally, given the newly generated varying-focal-length datasets, we revise the fine-grained network by embedding the focal length information. The experimental evaluation shows that the depth inference with known focal length achieves significantly better performance than the one without the focal length information. The source code and the generated datasets will be available on the author s website.
The rest of this paper is organized as follows: Section II introduces the related works. The ambiguity between the focal length and monocular depth estimation is discussed in Section III. Section IV describes the generating process from fixed-focal-length dataset to varying-focal-length dataset. The proposed fine-grained network embedding focal length information is elaborated in Section V, and the experimental results on the four datasets are reported in Section VI. The paper is concluded in Section VII.
Ii Related Work
Depth extraction from single images has received a lot of attention in recent years, while it remains a very hard problem due to the inherent ambiguity. To tackle this problem, classic methods [20, 21, 22, 23, 24, 25, 26, 27] usually make strong geometric assumptions that the scene structure consists of horizontal planes, vertical walls and superpixels, employing the Markov random field (MRF) to infer the depth by leveraging the handcrafted features. One of the first work, proposed by Hoiem et al. , creates realistic-looking reconstructions of outdoor images by assuming planar scene composition. In [21, 22], simple geometric assumptions have proven to be effective in estimating the layout of a room. In order to improve the accuracy of the depth-based methods, Saxena et al. [23, 24]
utilized MRF to infer depth from both local and global features extracted from the image. In addition, superpixels are introduced in the MRF formulation to enforce neighboring constraints. The work has also been extended to 3D reconstruction of scenes .
Non-parameter algorithms [2, 29, 30, 31] are another kind of classical methods for learning the depth from a single image, relying on the assumption that the similarities between regions in the RGB images imply similar depth cues as well. After clustering the training dataset based on the global features (e.g. GIST , HOG ), these methods first search the candidate RGB-D of the input RGB image in the feature space, then, the candidate pairs are warped and fused to obtain the final depth. Karsch et al.  proposed a depth transfer method to warp the retrieved RGB-D using SIFT flow , followed by a global optimization framework to smooth the resulting depth. He et al.  employed a sparse SIFT flow to speed up the depth inference based on the work . Konrad et al.  computed a median over the retrieved depth maps followed by cross-bilateral filtering for smoothing. Instead of warping the retrieved candidates, Liu et al.  explored continuous variables to represent the depth of image superpixels and discrete ones to encode relationships between neighboring superpixels, formulating the depth estimation as an optimization problem of the discrete-continuous graphical model. For learning the indoor depth, Zhuo et al.  exploited the structure of the scene at different levels to learn depth from a single image.
Recently, convolutional neural networks have seen remarkable advances in the high-level problems of computer vision, which have also been applied with great success to depth extraction from single images [36, 3, 37, 38, 39, 40, 4, 5]
. There exist two major approaches for the task of depth estimation in the related references: multi-scale training technique and super-pixel pooling with conditional random field (CRF) algorithm. In order to accelerate the convergence of the parameters during the training phase, most of the works are built upon winning architectures of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), often initializing their networks with Alex , VGG , or ResNet . Eigen et al.  first addressed this issue by fusing the depths from the global network and refined network. Their work later was extended to use a multi-scale convolutional network to predict depth, normal and semantic label from a single image in a deeper neural network . Other methods to obtain the fine-grained depth leveraged the representation of the neural network and the inference of the CRFs. Liu et al.  presented a deep convolutional neural field model based on fully convolutional networks and a novel superpixel pooling method, combining the strength of deep CNN and the continuous CRF into a unified CNNs framework. Li et al.  and Wang et al.  leveraged the benefit of the hierarchical CRFs to refine their patch-wise predictions from superpixel down to pixel level. Roy et al. 
combined random forests and convolutional neural networks to tackle the depth estimation. Lainaet al.  built a neural network on ResNet, followed by designed up-sampling blocks to obtain high resolution depth. However, the middle-level features are not fused into the network to obtain detailed information of the depth. Based on the multi-scale network [36, 3], Dan et al.  effectively exploited the side outputs of deep networks to infer depth by reformulating the continuous CRFs of the monocular depth estimation as sequential deep networks.
For all these depth learning methods, the experimental datasets are usually created by Kinect or laser scanner, where the RGB camera has a fixed focal length. In other words, currently the available depth datasets in the literature are all of fixed-focal-length. However, there exists an inherent ambiguity between monocular depth estimation and focal length, as described in our work . Without knowing the camera’s focal length, the depth can not be truly estimated from a single image. In order to remove the ambiguity, the camera’s focal length should be considered in both depth learning and inference phases. In the following section, we will discuss the inherent ambiguity in depth recovery in detail.
Iii Inherent Ambiguity
Scene depth refers to the distance from the camera optical center to the object along the optical axis. In deep learning based methods for the monocular depth estimation, the depth of each pixel is inferred by fusing global and contextual information, extracted from the corresponding receptive fields in the input image, followed by affine transformations and non-linear operations, as illustrated by the following equation.
where is the depth of the pixel , is the receptive field corresponding to the pixel in the depth map,
is the activation function andare the parameters of the models.
In order to extract long range global information, the deep neural networks were introduced in the research community for monocular depth estimation. However, the deeper networks are very hard to train due to the vanishing gradient or exploding gradient. In addition, it may lead to another problem about the receptive fields. Note that for a specific network architecture, we can infer the theoretical receptive field associated with the output node in every layer. However, the contribution of various regions in the theoretical receptive field is not the same. To explore the role of each pixel location in the view-of-field, we propose a novel method to visualize the effective receptive field, as shown in Figure 1. From the output layer to the input layer, the counts of per pixel evolved in the convolution operation is obtained layer by layer.
In current nets of depth estimation from single images, the convolution operation usually adopts the technique of sharing weights in each channel, and the weights are initialized by sampling a Gaussian with zero mean and 0.01 variance. Once the network is trained, the parameters within each channel are fixed and shared. In addition, the number of use of each pixel for the final prediction could describe the complexity of the combination of network weights at each layer, including affine transformation and non-linear operation. The higher complexity of the combination, the better ability to character the problem of the corresponding task. In a statistical sense, this number represents that the pixel information is frequently used in monocular depth estimation, regardless of the weights, which makes it able to view the contribution of each pixel. It is observed that the deeper the depth of the network, the larger the value in the middle of the receptive field, while the one along the edge is in the opposite, which reveals that the actual receptive field is smaller than the theoretical receptive field, and it also obeys the Gaussian distribution as described in Luoet al. . In order to enlarge the view-of-field in the specific network, a fully connected layer is a better choice when the resolution of the feature maps is very small.
The methods for monocular depth estimation are based on the assumption that the similarities between regions in the RGB images imply similar depth cues as well. There exists an inherent ambiguity between the focal length and the scene depth learned from a single image, as analyzed in the following.
Based on the imaging principle, the image of an object projected by a long-focal-length camera in the far distance could be exactly the same as the one captured by a short-focal-length camera at a short distance. This is called the indistinguishability between the scene depth and the focal length in images. For the sake of convenience, we assume that the imaging model of the camera is the pinhole model without loss of generality. For simplicity, assume the space object is linear, as shown in Figure 2. The images of the planar space object under and are , respectively, where . As a result, we are not able to infer the real depth without camera focal length from its projected image, since , , as shown in Figure 2.
To demonstrate the ambiguity between the depth and the focal length, we collected 250 image pairs in the laboratory setting with approximate context. These images are captured by the same camera at two different focal lengths: 50 and 105 , where the actual depth difference between the two images in each group is at least 1 . Then, we employ Liu et al.  and Eigen et al.  methods to infer the depth of the above dataset. Some experimental results are shown in Figure 4. By human-computer interaction method, the depths of the image pairs with two focal length are measured, as shown in Figure 3. The focal length of the left image is 105 , and the right one is 50 . Given the depths inferred by Liu et al. , the matching regions of the fixed object are selected to compute the average depth. The experiment shows that the average depth difference is 0.07506 , while the actual depth difference between the two images is 2 . By this measure, we take Liu et al.  and Eigen et al.  methods to evaluate the collected dataset, as reported in Table I, the corresponding error rates are and respectively. The experiments demonstrate that there exists inherent ambiguity between the focal length and the scene depth learned from single images.
Iv Dataset Transformation
In order to remove the above ambiguity, the camera’s intrinsic parameter should be taken into account in the depth learning from single images, at least the focal length information should be used as input in both training and testing phases. However, all available depth datasets (like Make3D and NYU v2) in the literature are of fixed focal length. In order to remove the ambiguities caused by the focal length, we propose an approach to transform a fixed-focal-length dataset into a varying-focal-length dataset. The pipeline of the proposed approach is shown in Figure 5, and the implementation details of the dataset transformation is described in the following subsections.
Iv-a Varying-focal-length image generation
As shown in Figure 5, given the camera’s intrinsic parameters and the corresponding RGB-D image, the imaging process can be formulated as:
where is the principle point, is the focal length, is the corresponding depth value, and is the 3D space point in the camera system corresponding to the image pixel .
To transform the 3D points from the original camera coordinate to a new system, a translation and a rotation are performed according to
where is the rotation matrix, and
is the translation vector. As shown in Figure5, the camera coordinate system is transformed to a new system .
By specifying a new focal length, or new camera’s intrinsic matrix, the transformed 3D scene points can be projected to new image points. During the process of reprojection, multiple 3D points along the ray may be projected to the same image pixel, such as the 3D points and pixel in Figure 5. To solve this issue, we only project the 3D point with the smallest depth value, since other points are occluded by the nearest one. To obtain a real image, the new image points are quantized, and the RGB values are taken from the corresponding original image.
Iv-B Post-processing of the generated varying-focal-length datasets
After the above operations, some holes are produced in the generated RGB-D image, as shown in Figure 6. By analyzing the shapes and properties of the holes, we propose a simple yet effective method to fill these holes.
First, we locate the positions of the empty holes, and then design
binary filters to fill them. The experimental holes are filled by the corresponding binary templates, which are mainly classified into three classes, as shown in Figure7, where number 0 represents the hole pixel, and number 1 represents pixel without hole.
For class-a, a 4-neighborhood binary template is employed for mean interpolation. For class-b, we directly use the correspondingtemplates for mean interpolation. For class-c, the template elements all equal to zero, we iteratively perform interpolation by virtue of the intermediate interpolation results as follows: Since the iteration scheme is from left to right, and top to bottom, at least one of the two pixels at m and n has been interpolated by the previous iteration, then the (RGB-D) value at pixel k is assigned to either that at m or n with a chance.
Through the above proposed filtering process, the projected holes could be filled. Some filling results are shown in Figure 6.
Iv-C Implementation details
Based on extensive experiments, we find that a reasonable range of the rotation angle should be within. Upon completion of the rotation, if the center of the new image coincide with the original one, the translation vector is computed as follow.
If the rotation is around the axis by angle , we set
where is the number of 3D points, and is the assigned new focal length.
If the rotation is around the axis by angle , we set
Using the above proposed approach, we have transformed the NYU dataset and the Make3D dataset into the new varying-focal-length datasets (VFL). According to the equations (2) and (3), the depth map of the transformed images are generated by strict geometric relationship. In the stage of quantization, some holes are introduced. However, the hole portion of the depth map is very small as shown in Figure 6, benefiting from the completion technique in equations (4) and (5). By making use of contextual information, the holes of the depth map are filled with the proposed filtering method, which approaches to the ground truth () in visualization.
Figure 8 shows two examples of the newly generated images from the Make3D dataset and the NYU dataset. For the generated VFL datasets, the focal-length values are 460, 500, 540, 620, 660, and 700 pixels, respectively, where the value of the initial focal length is 580. From the results we can see that the generated database is geometrically reasonable by visual verification.
V Learning Monocular Depth with Deep Neural Network
In this section, based on the varying-focal-length datasets, we propose a new model to learn depth from a single image by embedding focal length information.
V-a Network Model
The current DNN architectures are mostly built on the network 
for digit recognition, which consists of convolution, pooling, and fully connected layers. The essential power behind the remarkable success is that the framework selects the invariant abstract features which are suitable for the high-level problem. For pixel-wise depth prediction, in order to remedy the resolution loss caused by the convolution striding or pooling operations, some techniques are proposed, such as the deconvolution or upsampling methods[36, 3, 4, 5]. Since these operations are usually applied on the last convolutional layer, it is very hard to accurately restore spacial structure information. In order to obtain pixel-wise fine-grained results, the classical skip connection is exploited, as described in the U-Net  and the FCN . For monocular depth learning, since the distribution of the depth is different from the one of the category from pre-trained model, we propose a novel transfer network (T-net), which converts feature maps from the category cues to the depth mapping, rather than utilizing feature maps directly from previous layers.
Our proposed network can be efficiently trained in an end-to-end manner, which is symmetrical on the middle network layer, as illustrated in Figure 9. The first part of the network is based on the VGG network, which is initialized with the corresponding pre-trained weights. The second part of our architecture consists of the global transfer layer and upsampling architectures, which leads to the global information transformed from the category cues to the depth mapping and gain high resolution depth respectively. The third part of the network are T-nets, which effectively convert the middle-level information to meet the distribution of the monocular depth. The last part of our architecture are three fully connected layers for embedding the focal length information. Here, we first use the focal length to generate seven same digits, which are then connected to 64 and 512 nodes layer by layer, and finally the 512 nodes are concatenated with the global information.
For the sake of effectively fusing the middle-level information, we divide the pre-trained VGG network into 5 blocks according to the resolutions of the feature maps, as shown in the left green blocks in Figure 9. The depth of the neural networks is important for depth estimation, as described in Laina et al. . That means the deeper the depth of the network, the more beneficial to improving the accuracy of the depth extraction. However, very deep network may lead to a result that the actual receptive field is smaller than the theoretical receptive field, as illustrated in section III. Inspired by this observation, we propose a fully connected layer to bridge the subsampling module and the upsampling module, which is able to obtain full-resolution receptive field and convert the global information from category to depth simultaneously. To obtain the high resolution depth, we follow the work described in  by introducing the unpooling layers, which maps each pixel into the top-left corner of a (zero) kernel to double the feature map sizes, followed by a convolution implementation to fuse information, as shown in the Upconv X architecture in Figure 9.
To effectively exploit the middle layer features, we propose the T-net architecture, inspired by the ResNet [44, 49] and Highway [50, 51], to facilitate the detailed structural information propagation during both the forward and the backward stages. The identity mapping with shortcuts can facilitate the optimization of the deep network, since it iteratively generates small magnitudes of responses by passing main information layer by layer, in analogy to Taylor series. While the global information is propagated through the architecture of the first part and the second part, we utilize the T-nets to transform the detailed information in the third part. The first layer of the per T-net removes the redundant information by reducing the channels of the networks, followed by another layer to convert the feature cues. The feature maps from the T-net are concatenated with the corresponding features generated from the previous layer in the second part, followed by the unpooling and convolution operations to remedy the low resolution. As illustrated in Figure 9, the feature maps in pink color are generated from the previous layer, and the feature maps in green color are the transformed middle-level information through the T-net.
V-B Loss function
The parameters of the proposed network are learned through minimizing the loss function defined on the prediction and the ground truth. In general, the mean squared error (MSE) loss is taken as the standard loss, which minimizes the squared Euclidean norm between the predictions y and the ground truth:
where N is the number of valid pixels in the batch-size training images.
Although MSE struggles to handle the uncertain inherence in recovering lost high-frequency details, minimizing MSE encourages finding pixel-wise averages of plausible solutions, leading to blurred predictions as described in [52, 53, 54]. To solve this issue, L1 yields better detail than L2 norm. Based on our experimental study, we find that the error of depth at distant is larger than that at a close distance. Inspired by the observation, a weighted loss function is introduced by penalizing the pixels with large errors. We propagate large gradients in the locations of large errors during the training phase, which coincide with the gradient propagation of the L2 norm. As described in Zwald and Lambert-Lacroix , the BerHu loss function is appropriate for the above phenomena, which consists of L2 and L1 norms. Following Laina et al. , we take the BerHu loss as the error function as below by integrating the advantages of both the L2 norm and L1 norm, resulting in accelerated optimization and detailed structure.
where , and indexes the pixels in the current batch.
To demonstrate the effectiveness of the proposed deep neural network and the embedded focal length for monocular depth estimation, we carry out comprehensive experiments on four publically available datasets and the synthetic datasets generated in this paper: NYU v2 , Make3D , KITTI , the varying-focal-length datasets generated from section IV and SUNRGBD . In the following subsections, we report the details of our implementation and the evaluation results.
Vi-a Experimental setup
Datasets. The NYU Depth v2  consists of 464 scenes, captured using a Microsoft Kinect. Followed by the official split, the training dataset is composed of 249 scenes with the 795 pair-wise images, and the testing dataset includes 215 scenes with 654 pair-wise images. In addition, the raw dataset contains 407,024 new unlabeled frames. For data augmentation, we sample equally-spaced frames out of each raw training sequence, and further align the RGB-D pairs by virtue of the provided toolbox, resulting in approximately 4k RGB-D images.
Then, the sampled raw images and 795 pair-wise images are online augmented by Eigen et al. . The input images and the corresponding depths are simultaneously transformed using small scaling, color transformations and flips with a chance of 0.5, then we randomly crop the augmented images and depths down to the desired size of the network. Note that the following datasets are also online augmented by the same strategy. As a result, the magnitude of samples after data augmentation on NYU depth is about 48k, which is far less than 2M for coarse network, and 1.5M for fine network, as described in Eigen et al. . Due to the hardware limitation, we down-sample the original frames from size pixels to as the input to the network.
The Make3D dataset  contains 400 training images and 134 testing images of outdoor scenes, generated from a custom 3D laser scanner. While the depth map resolution of the ground truth is only , not matching the corresponding original RGB images with pixels, we resize all RGB-D images to by preserving the aspect ratio of the original images. Due to the neural network architecture and hardware limitations, we subsample the resolution of the RGB-D images to .
The dataset  contains 93k depth maps with corresponding raw LiDaR scans and RGB images. Following the suggestion in Uhrig et al. , the training dataset is composed of 86k pair-wise images, and the testing dataset includes 1k pair-wise images selected from the full validation split. Since the LiDAR returns no measurements to the upper part of the images, we only use the bottom two thirds of the images to produce a fixed crop size of . In order to reduce the load of computation, we randomly crop the images from the resolution to during the training stage.
The varying-focal-length (VFL) datasets contain two datasets: VFL-NYU and VFL-Make3D, which are the varying-focal-length datasets from NYU Depth v2 and Make3D respectively, as described in section IV. For VFL-NYU, the training dataset and testing dataset of every focal length are split in the official manner. Following the above NYU data argumentation, we perform the training samples argumentation using the same manner without considering the raw unaligned frames, producing approximate 30k training pairs in total. As for VFL-Make3D dataset, we implement the same operations with the above Make3D dataset, resulting in about 17k training pairs.
The SUNRGBD dataset  contains 10,335 RGB-D images, at a similar scale as PASCAL VOC, which is captured by four different sensors - Intel RealSense 3D Camera for tablets, Asus Xtion LIVE PRO for laptops, and Microsoft Kinect versions 1 and 2 for desktop. The dataset, although constructed of various focal lengths, it is different with the dataset generated by our VFL approach. In our approach, the varying-focal-length datasets are generated from the fixed-focal-length datasets, the images with varying focal lengths are of the same scene, while in the SUNRGBD dataset, different focal-length images correspond to different scenes. In addition, the SUNRGBD dataset contains more distortion parameters caused by the four different sensors. Following the official split, the training dataset is composed of 5285 pair-wise images, and the testing dataset includes 5050 pair-wise images. In the following experiments, we sample frames out of the source dataset, resulting in 2642 pair-wise training images and 1010 pair-wise test images.
Evaluation Metrics. For quantitative evaluation, we report errors obtained with the following extensively adopted error metrics.
Average relative error:
Root mean squared error:
Accuracy with threshold : percentage (%) of subject to
where is the estimated depth, denotes the corresponding ground truth, and is the total number of valid pixels in all images of the validation set.
We use TensorFlow deep learning framework to implement the proposed network, and train the network on a single NVIDIA GeForce GTX TITAN with 12GB memory. The objective function is optimized using the Adam method . During the initialization stage, weight layers in the first part of the architecture are initialized using the corresponding model (VGG) pre-trained on the ILSVRC 
dataset for image classification. The weights of newly added network are assigned by sampling a Gaussian with zero mean and 0.01 variance, and the learning rate is set at 0.0001. Finally, our model is trained with a batch size of 8 for about 20 epochs.
|Method||Error (lower is better)||Accuracy (higher is better)|
|Method||Error (lower is better)||Accuracy (higher is better)|
|Karsch et al. ||0.374||1.12||0.134||0.447||0.745||0.897|
|Liu et al. ||0.335||1.06||0.127||-||-||-|
|Li et al. ||0.232||0.821||0.094||-||-||-|
|Liu et al. ||0.230||0.824||0.095||0.614||0.883||0.975|
|Wang et al. ||0.220||0.745||0.094||0.605||0.890||0.970|
|Eigen et al.||0.215||0.907||-||0.611||0.887||0.971|
|R. and T. ||0.187||0.744||0.078||-||-||-|
|E. and F. ||0.158||0.641||-||0.769||0.950||0.988|
|E. and F. * ||0.155||0.576||0.065||0.787||0.948||0.986|
|L. * ||0.204||0.833||0.097||0.617||0.889||0.963|
Vi-B Analysis of the different architectures and loss functions
In the first series of experiments we focus on the NYU Depth v2  dataset. The proposed model is evaluated and compared with other classical architectures and training loss functions. Specifically, we conduct the following experiments for comparison: (i) T-net and skip connection, (ii) BerHu loss and L1 loss, (ii) fully convolution (GIL-convolution) and fully connected (GIL-connected) as global information layer for bridging downsampling part and upsampling part. The results of experimental comparisons are reported in Table II. It is evident that the model with the T-net achieves better performance than the one with standard skip connection.
The table also compares the proposed model with BerHu loss and L1 loss, respectively. As expected, the model with BerHu loss yields more accurate depth. Finally, we analyze the impact of the GIL to the accuracy of the monocular depth, by comparing GIL-convolution and the GIL-connected. It is evident that the depth performance improves with the increase of the size of receptive field.
Vi-C Comparisons with the state-of-the-art
We also compared our method with the state-of-the-art approaches on NYU v2, Make3D and KITTI datasets. For the baselines, we reproduced the algorithms of VGG-Laina et al.  and multi-scale Eigen, Fergus  built on VGG, denoted as L. * , and E. and F. * , respectively, as shown in Table III. For Eigen and Fergus , we modify the network by removing the fully connection layers in the scale 1 and directly implement upsampling operation in the last convolution layer, finally train the model in an end-to-end manner instead of the stage-wise manner. Here, the results of other algorithms are taken from the original papers. The comparative results of the proposed approach and baselines are reported in Table III. It is evident that our method is significantly better than the state-of-the-art approaches. By comparing VGG-Laina et al.  with VGG-ours, we find that the effective integration of the middle-level information leads to a better performance. In addition, the performance of our reproductive algorithms is comparable with the corresponding methods. Figure 10 shows some qualitative comparisons of the depth maps recovered by our method and Laina et al.  using the NYU v2 dataset. It can be seen that the estimated maps by our method can obtain more detailed information than Laina et al. , benefiting from the effective fusion of the middle-level information with the T-net.
|Method||Error (lower is better)|
|Karsch et al. ||0.355||9.20||0.127|
|Liu et al. ||0.335||9.49||0.137|
|Liu et al. ||0.314||8.60||0.119|
|Li et al. ||0.278||7.19||0.092|
|Roy and Todorovic ||0.260||12.40||0.119|
|E. and F. *-VGG ||0.228||7.14||0.093|
|L. *-VGG ||0.236||7.54||0.091|
|Method||Error (lower is better)||Accuracy (higher is better)|
|E. and F. * ||0.095||4.131||0.042||0.893||0.976||0.993|
|L. * ||0.108||4.326||0.049||0.874||0.975||0.993|
|E. and F. * ||0.269||0.137||0.194|
|L. * ||0.182||0.098||0.142|
In addition, we evaluated our proposed model on the Make3D dataset , generated from a custom 3D laser scanner. Following [3, 4], the error metrics are computed on the regions with ground truth depth maps less than 70m. We also reproduce the algorithms of VGG-Laina et al.  and multi-scale Eigen and Fergus  with VGG as L. *-VGG  and E. and F. *-VGG  in Table IV. Our modified E. and F. *-VGG  and VGG-ours outperform other methods by a significant margin, which reveals that the middle-level information is useful for the accurate depth inference, as well as multi-scale information. As expected, our proposed method yields more detailed structural information of the depth compared with Laina et al. , as shown in Figure 11.
|Method||VFL-NYU test set ()||NYU test set (654)|
|Error (lower is better)||Accuracy (higher is better)||Error (lower is better)||Accuracy (higher is better)|
Furthermore, considering that the Make3D  is a very small dataset, to prove the advantage of the proposed model in the outdoor images, we further evaluate the proposed approach on the KITTI dataset . Due to the resolution difference of the training images and the testing images, we replace the fully connected layer of our proposed network with fully convolution layer. To achieve a fair comparison with the state-of-the-art methods, we also reproduce the algorithms of L. *-VGG  and E. and F. *-VGG  as above. The quantitative results of each approach are reported in Table V. It is clear that the proposed approach yields lower error than both the L. *-VGG  and the L. *-VGG  approachs, which demonstrates the advantage of the proposed model. As shown in Figure 12, compared with L. *-VGG et al.  and E. and F. *-VGG et al. , two of the best methods in the literature, it is evident that our approach achieves better fine-grained depth in visualization. Note that our method and the reproduced algorithms utilize sparse point information to infer dense depth from a single image, which reveals that these methods can also be used in 3D LiDARs to address depth completion problem.
In addition, we also compared the execution time between the proposed method and the state-of-the-art algorithms. Table VI tabulates the real runtime on the NYU v2, Make3D, and KITTI datasets, corresponding to resolution of , and , respectively. L. *  is the fastest algorithm since it has less number of convolutional layers and filters. Since the proposed method exploits T-nets to fuse middle-level information, it runs a little bit slower than the L. *  algorithm. However, the speed of our approach still performs favorably against the E. and F. *  algorithm as the later one utilizes large convolutional kernel to integrate multi-scale information. It is worth noting that it only takes about 0.1 sec in total for our method to recovery the depth map for a single image (), which enables the possibility of inferring fine-grained monocular depth in real-time.
To evaluate the convergence process of the proposed method, the training curves of the NYU v2 dataset and the Make3D database are visualized in Figure 13, and the state-of-the-art approaches are also implemented for comparison. It is notable that our algorithm exhibits lower training error, especially for the KITTI dataset, which contributes the performance gains in Table III and Table V. In addition, our proposed method converges faster than the L. *-VGG  and the E. and F. *-VGG , which facilitates the optimization by providing faster convergence at the early stage, benefiting from the T-net architecture. These comparisons verify the effectiveness of the proposed method for learning depth from a single image.
|Method||VFL-Make3D test set ()||Make3D test set (134)|
|Error (lower is better)||Accuracy (higher is better)||Error (lower is better)||Accuracy (higher is better)|
Vi-D Evaluations of VFL dataset with focal length information
Given varying-focal-length datasets generated in section IV, we utilize the network of the section V to learn the depth from a single image, where the focal length is embedded in the network during the phases of training and testing. For comparison, the experiments are also implemented on L. *-VGG  and E. and F. *-VGG  respectively. For E. and F. *-VGG , the focal length information is embedded in the last convolutional layer of the scale 1, as similar with the section V. We implement the same operation on the last layer of the downsampling part in the L. *-VGG . In addition, the experiments without focal length are also implemented on the above models for comparison.
For VFL-NYU dataset, the experimental results are reported in Table VII, where NFL denotes the model without embedded focal length, FL denotes the model with embedded focal length. At the same time, the learned models from VFL-NYU dataset are also implemented on the NYU test dataset. As shown in the Table, for average relative error, -FL, ours-FL and -FL increase the accuracy by about two percentage points on average, compared with corresponding methods without embedded focal length information. Figure 14 shows the increase of accuracy in the form of histogram, which reveals that each model with embedded focal length obtains much better performance than that without the focal length, where L. *-VGG  achieves a significant margin, benefiting from that the network with only one path could effectively deliver the focal length information during forward and backward phases.
We also implement our approach and the state-of-the-art methods [4, 3] on the VFL-Make3D dataset, as reported in Table VIII, where the same trained model is also implemented on the Make3D test dataset. It is evident that, for average relative error, the three approaches with embedded focal length information also increase the accuracy by about two percentage points on average, compared with the corresponding methods without the focal length information. As shown in Figure 15, all models with the embedded focal length information outperform the corresponding models without the focal length information. However, the performance gains of the VFL-Make3D dataset on root square error is not as good as that of the VFL-NYU dataset, which is caused by the accuracy range of the ground truth and the training dataset size.
From Table VII and Table VIII, it is notable that the models trained on the VFL-NYU dataset and VFL-Make3D dataset achieve better performance than the corresponding models without the embedded focal length information on the NYU test dataset and Make3D test dataset, which also reveals that the focal length information contributes to the performance increase in depth estimation from single images. However, compared the Table VII with Table III, the experimental results of the nets on the VFL-NYU dataset show slight weakness than the corresponding ones trained on the NYU depth. This phenomena is mainly caused by the fact that the VFL-NYU dataset is much less than the NYU dataset with raw video frames. For the model trained on the NYU depth, except for the 795 pair-wise images, we also fetch 4,000 samples from the raw dataset by virtue of the provided toolbox. While the VFL-NYU dataset is only generated from 1,449 pair-wise images, which has less samples than the models in Table III. In addition, The VFL-Make3D and Make3D database have approximately same number of samples, which achieve lower error difference than the VFL-NYU and the NYU datasets, as reported in Table VIII and Table IV.
To further prove the benefits of embedding focal length, we also performed experiments on the SUNRGBD  dataset. In order to achieve a fair comparison with the state-of-the-art methods, we also reproduce the algorithms of E. and F. *-VGG  and L. *-VGG  in the same way. The quantitative results of each approach are reported in Table IX. The experimental results show that depths inferred from the model with embedded focal length significantly outperform those without the focal length information in all error measures, which demonstrates the contribution of the focal length information for depth estimation from a single image.
|Method||Error (lower is better)||Accuracy (higher is better)|
The above experiments demonstrate that we can boost the inference accuracy of the depth when the focal length is embedded in the network in both learning and inference phases.
In this paper, focusing on the monocular depth learning problem, we first studied the inherent ambiguity between the scene depth and the focal length in theory, and verified it using real images. In order to remove the ambiguity, we proposed an approach to generate the varying-focal-length datasets from the public fixed-focal-length datasets. Then, a novel deep neural network was proposed to infer the fine-grained monocular depth from both the fixed- and varying-focal-length datasets. We demonstrated that the proposed model, without the embedded focal length information, could achieve competitive performance on the public datasets with the state-of-the-art methods. Furthermore, by using the newly generated and publicly varying-focal-length datasets, the proposed approach and the state-of-the-art algorithms embedding focal length yield a significant performance increase in all error metrics, compared with the corresponding models without encoding focal length. The extensive experiments demonstrate that the embedding focal length is able to improve the depth learning accuracy from single images.
This work was supported by National Natural Science Foundation of China under the grant No (61333015, 61421004, 61772444, 61573351).
-  A. Saxena, M. Sun, and A. Y. Ng, “Make3d: Learning 3d scene structure from a single still image,” IEEE transactions on pattern analysis and machine intelligence, vol. 31, no. 5, pp. 824–840, 2009.
-  K. Karsch, C. Liu, and S. B. Kang, “Depth extraction from video using non-parametric sampling,” in European Conference on Computer Vision. Springer, 2012, pp. 775–788.
-  D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2650–2658.
-  I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” in 3D Vision (3DV), 2016 Fourth International Conference on. IEEE, 2016, pp. 239–248.
-  D. Xu, E. Ricci, W. Ouyang, X. Wang, and N. Sebe, “Multi-scale continuous crfs as sequential deep networks for monocular depth estimation,” arXiv preprint arXiv:1704.02157, 2017.
-  C. Hazirbas, L. Ma, C. Domokos, and D. Cremers, “Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture,” in Asian Conference on Computer Vision. Springer, 2016, pp. 213–228.
-  Y. Cao, C. Shen, and H. T. Shen, “Exploiting depth from single monocular images for object detection and semantic segmentation,” IEEE Transactions on Image Processing, vol. PP, no. 99, pp. 1–1, 2016.
-  J. Shotton, A. Kipman, A. Blake, R. Girshick, A. Fitzgibbon, T. Sharp, M. Cook, M. Finocchio, R. Moore, and P. Kohli, “Efficient human pose estimation from single depth images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 12, p. 2821, 2013.
-  S. Song and J. Xiao, “Deep sliding shapes for amodal 3d object detection in rgb-d images,” Computer Science, vol. 139, no. 2, pp. 808–816, 2016.
-  P. K. Nathan Silberman, Derek Hoiem and R. Fergus, “Indoor segmentation and support inference from rgbd images,” in ECCV, 2012.
-  A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013.
-  V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” arXiv preprint arXiv:1511.00561, 2015.
-  H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmentation,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1520–1528.
-  J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.
-  B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, “Hypercolumns for object segmentation and fine-grained localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 447–456.
-  O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2015, pp. 234–241.
-  G. Lin, A. Milan, C. Shen, and I. Reid, “Refinenet: Multi-path refinement networks with identity mappings for high-resolution semantic segmentation,” arXiv preprint arXiv:1611.06612, 2016.
-  Y. Liu, M.-M. Cheng, X. Hu, K. Wang, and X. Bai, “Richer convolutional features for edge detection,” arXiv preprint arXiv:1612.02103, 2016.
-  S. Xie and Z. Tu, “Holistically-nested edge detection,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1395–1403.
-  D. Hoiem, A. A. Efros, and M. Hebert, “Automatic photo pop-up,” ACM transactions on graphics (TOG), vol. 24, no. 3, pp. 577–584, 2005.
-  A. G. Schwing and R. Urtasun, “Efficient exact inference for 3d indoor scene understanding,” in European Conference on Computer Vision. Springer, 2012, pp. 299–313.
-  V. Hedau, D. Hoiem, and D. Forsyth, “Thinking inside the box: Using appearance models and context based on room geometry,” in European Conference on Computer Vision. Springer, 2010, pp. 224–237.
-  A. Saxena, S. H. Chung, and A. Y. Ng, “Learning depth from single monocular images,” in NIPS, vol. 18, 2005, pp. 1–8.
-  ——, “3-d depth reconstruction from a single still image,” International journal of computer vision, vol. 76, no. 1, pp. 53–69, 2008.
-  G. Wang, H.-T. Tsui, and Q. J. Wu, “What can we learn about the scene structure from three orthogonal vanishing points in images,” Pattern Recognition Letters, vol. 30, no. 3, pp. 192–202, 2009.
-  G. Wang, Z. Hu, F. Wu, and H.-T. Tsui, “Single view metrology from scene constraints,” Image and Vision Computing, vol. 23, no. 9, pp. 831–840, 2005.
-  G. Wang, H.-T. Tsui, Z. Hu, and F. Wu, “Camera calibration and 3d reconstruction from a single view based on scene constraints,” Image and Vision Computing, vol. 23, no. 3, pp. 311–323, 2005.
-  R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk, “Slic superpixels compared to state-of-the-art superpixel methods,” IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 11, pp. 2274–2282, 2012.
-  C. Liu, J. Yuen, and A. Torralba, “Sift flow: Dense correspondence across scenes and its applications,” IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 5, pp. 978–994, 2011.
-  J. Konrad, M. Wang, and P. Ishwar, “2d-to-3d image conversion by learning depth from examples,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on. IEEE, 2012, pp. 16–22.
-  M. Liu, M. Salzmann, and X. He, “Discrete-continuous depth estimation from a single image,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 716–723.
-  A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” International journal of computer vision, vol. 42, no. 3, pp. 145–175, 2001.
-  N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1. IEEE, 2005, pp. 886–893.
-  L. He, Q. Dong, and G. Wang, “Fast depth extraction from a single image,” International Journal of Advanced Robotic Systems, vol. 13, no. 6, p. 1729881416663370, 2016.
-  W. Zhuo, M. Salzmann, X. He, and M. Liu, “Indoor scene structure analysis for single image depth estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 614–622.
-  D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” in Advances in neural information processing systems, 2014, pp. 2366–2374.
-  F. Liu, C. Shen, and G. Lin, “Deep convolutional neural fields for depth estimation from a single image,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5162–5170.
B. Li, C. Shen, Y. Dai, A. van den Hengel, and M. He, “Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1119–1127.
-  P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. L. Yuille, “Towards unified depth and semantic prediction from a single image,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2800–2809.
-  A. Roy and S. Todorovic, “Monocular depth estimation using neural regression forest,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5506–5514.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
-  L. HE, Q. DONG, and Z. HU, “The inherent ambiguity in scene depth learning from single images,” SCIENTIA SINICA Informationis, vol. 46, no. 7, pp. 811–818, 2016.
-  W. Luo, Y. Li, R. Urtasun, and R. Zemel, “Understanding the effective receptive field in deep convolutional neural networks,” in Advances in Neural Information Processing Systems, 2016, pp. 4898–4906.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
-  A. Dosovitskiy, J. Tobias Springenberg, and T. Brox, “Learning to generate chairs with convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1538–1546.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in European Conference on Computer Vision. Springer, 2016, pp. 630–645.
-  R. K. Srivastava, K. Greff, and J. Schmidhuber, “Highway networks,” arXiv preprint arXiv:1505.00387, 2015.
-  ——, “Training very deep networks,” in Advances in neural information processing systems, 2015, pp. 2377–2385.
-  C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic single image super-resolution using a generative adversarial network,” arXiv preprint arXiv:1609.04802, 2016.
-  M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi-scale video prediction beyond mean square error,” arXiv preprint arXiv:1511.05440, 2015.
-  M. Tatarchenko, A. Dosovitskiy, and T. Brox, “Multi-view 3d models from single images with a convolutional network,” in European Conference on Computer Vision. Springer, 2016, pp. 322–337.
-  L. Zwald and S. Lambert-Lacroix, “The berhu penalty and the grouped effect,” arXiv preprint arXiv:1207.6868, 2012.
-  J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger, “Sparsity invariant cnns,” in International Conference on 3D Vision (3DV), 2017.
-  S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgb-d: A rgb-d scene understanding benchmark suite,” in Computer Vision and Pattern Recognition, 2015, pp. 567–576.
-  M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
-  D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.