Self-Supervised Monocular Image Depth Learning and Confidence Estimation

03/14/2018 ∙ by Long Chen, et al. ∙ Bournemouth University 0

Convolutional Neural Networks (CNNs) need large amounts of data with ground truth annotation, which is a challenging problem that has limited the development and fast deployment of CNNs for many computer vision tasks. We propose a novel framework for depth estimation from monocular images with corresponding confidence in a self-supervised manner. A fully differential patch-based cost function is proposed by using the Zero-Mean Normalized Cross Correlation (ZNCC) that takes multi-scale patches as a matching strategy. This approach greatly increases the accuracy and robustness of the depth learning. In addition, the proposed patch-based cost function can provide a 0 to 1 confidence, which is then used to supervise the training of a parallel network for confidence map learning and estimation. Evaluation on KITTI dataset shows that our method outperforms the state-of-the-art results.



There are no comments yet.


page 2

page 5

page 9

page 13

page 14

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The human vision system is amazingly complex and extremely delicate. It can perceive depth through stereopsis, which relies on the displacement of the same object between the images received by the left and right retinas [1]

. With extensive visual experience and through trial and error, humans develop the ability to use contextual depth cues to achieve good and reliable perception of depth and better understanding of spatial structure. Among these depth cues, some of them do not rely on stereopsis, such as object occlusion, perspective, familiar and relative size, depth from motion, lighting and shading. Therefore, if blind in one eye or if performing a monocular task such as endoscopic surgery, we can still judge distance from these many different intuitive depth cues. In contrast, when using machine vision it is hard to infer the non-stereopsis depth cues. With the recent development of Deep Convolutional Neural Networks (DCNNs), machines can solve many computer vision problems when provided with very large human annotated datasets such as ImageNet


, which is known as supervised learning. Acquisition of labelled datasets is one of the biggest challenges for supervised learning, however, which is an expensive, time-consuming and labour-intensive task.

Figure 1: Our proposed framework can simultaneously estimate depth and the confidence of estimated depth.

In this paper, we propose a novel self-supervised computational framework that mimics the process of how a human learns varies of contextual depth cues from stereopsis. We train a DCNN for synthesizing depth from one view of the stereo image pair, then reconstruct the other view by the synthesized depth, and finally using the stereo vision epipolar constraint [3] to minimize the error of the depth synthesis.

Our approach does not require the ground truth depth for supervised training. Instead, we derive the implicit function of estimating depth from monocular images by the epipolar constraint of the stereo image pair. Therefore, the method can be regarded as self-supervised learning. Compared with previous work [4] [5] [6] addressing the same problem, we incorporate a patch-based image evaluation strategy, inspired by the classic patch matching algorithms for finding the best-matched patches between the left and right images. We use the Zero-Mean Normalized Cross Correlation (ZNCC) to measure the normalized similarities between these patches. A fully-differential patch-based ZNCC cost function is implemented to guide the depth synthesis process for more accurate results. Visual assessment shows that our approach can produce more accurate and robust depth estimations in both texture-rich and texture-less areas due to the enlargement of matching field from a pixel to a patch (see Figure 5). Empirical evaluations on KITTI dataset demonstrate the effectiveness of our approach and produce a state-of-the-art performance in monocular depth estimation task.

Our second contribution is that we train a parallel DCNN to evaluate the performance of the monocular depth estimation and output a 0 to 1 confidence map. The parallel DCNN is also trained in a self-supervised manner thanks to our ZNCC similarity measurement function. As ZNCC is a normalized measure of similarity, which can be approximated as the confidence of the depth estimation, we take the ZNCC loss to self-supervise the parallel DCNN (ConfidenceNet) during training so that we can estimate the confidence of the depth estimated from the first DCNN (DepthNet) during testing mode as shown in Figure 1. A confidence map is extremely useful for the monocular depth estimation task trained in an unsupervised manner, as the learned epipolar constraint only works well when there are clear corresponding pixels between the image pairs; it will fail and produce uncertain depth when occlusion and specularity exist in images. Our confidence map can give a basic assessment of the reliability of the predicted depth, which can then be further integrated into many applications such as monocular dense reconstruction, SLAM-based depth fusion [7], and many tasks need crucial accurate and confidence such as monocular endoscopic surgery.

2 Related Work

2.0.1 Stereo Depth Estimation.

The problem of stereo images depth estimation has been well studied for a long time [8] [9]. With the theory of epipolar constraint, accessing depth from stereo images can be regarded as a well-posed problem when ignoring the occlusions and depth discontinuities. Many stereo vision algorithms managed to achieve comparable results to ground truth depth acquired from depth sensors [10] [11].

2.0.2 Monocular Depth Estimation.

In contrast, estimating depth from monocular images is an ill-posed problem that is inherently ambiguous [12], and many research efforts have been devoted to the problem of monocular image depth estimation. One of the classic methods is Shape from Shading (SFS) [13], which is based on the gradual variation of shading as a cue to estimate the shape and depth. However, SFS has a strict prior assumption of Lambertian reflectance, uniform color and texture, and fixed light source direction, which are not applicable to most of the images in the real world. Saxena et al [14][15][16][17] used Markov Random Field (MRF) incorporated with multiscale image features to learn monocular cues in a supervised manner. However, the hand-craft local features used in these approaches limit the expressive power of supervised learning, and lack a global contextual understanding of the scene for learning consistent depth.

2.0.3 DCNNs based Monocular Depth Learning.

More recently, DCNNs [12] [18] are introduced to solve the challenge of monocular depth estimation problem, and has pushed the state-of-the-art forward in this area. Building on the success of this approach, several improvements have been made by incorporating probabilistic models such as Conditional Random Fields (CRFs)[19] [20] [21] [22], advanced network structures such as Resnet [23], two-streamed networks [24], multi-task joint training [25] [18] [26] [27]

and novel loss functions such as sparse supervision

[28], relative depth [29][30] and depth as classification [31]. Impressive as these works are, ground-truth depth data are still needed for the supervision of training these DCNNs.

2.0.4 Unsupervised Monocular Depth Learning.

Driven by DCNNs, view synthesis technology [32] has proven to be effective on synthesizing new views by sampling pixels from existing views [33] [34]

, which enables novel frameworks of unsupervised learning of monocular depth from stereo pairs, e.g., Deep3D

[35], Garg et al [4]. The works by Godard et al [5] and Zhou et al [6] advanced the networks by incorporating left-right consistency and pose estimations. However, a common weakness of these approaches is the use of pixel-wised photometric loss (L1-norm) to construct loss functions to guide the back-propagation process. Gradients are derived from the pixel intensity difference [6], which will lead to ambiguous gradients in texture-less areas and also in the regions that contain the mixture of thin structures and texture-less areas. Although multi-scale and smoothness loss functions are used to prevent such issue [4] [5] [6], the result is still not desirable and gradients are still likely to converge to local minimums due to the ambiguous pixel-wise loss. As shown in Figure 5, in a common speed limitation board area from the KITTI dataset, the direct pixel-wise photometric loss will lead to many local minimums shown in the right curve chart. While as the left curve chart shows the result of using our proposed patch-based ZNCC loss, the loss is more smooth and likely to converge to the global minimum. And the experiment result (the last row in Figure 5) shows our proposed method can effectively generate accurate depth in complex regions.

2.0.5 Novelty Compared to Previous Work.

We propose a novel multi-scale patch-based cost function that adopts the ZNCC as a similarity function to explicitly enlarge the matching field and increase the matching robustness. From another point of view, our proposed patch-based cost function implicitly integrate the classic Patch Matching (PM) algorithm as a minimization problem in our loss function. Although Garg et al [4] have discussed a straightforward idea of using the stereo matching algorithm as a pre-processing method to generate ”quasi ground-truth” depth for training, their result is not desirable due to the poor quality of ”quasi ground-truth”. Recently, Luo et al [36] also proposed a similar framework that firstly use a DCNN to synthesize stereo pairs from single images, and then use stereo matching to get depth. In contrast to these two works, we treat the stereo matching as a minimization problem and implement a fully differential PM algorithm as a cost function that is seamlessly integrated into our neural network. As the loss of the PM cost function can be passed through the whole network during a backward propagation, our network can produce more robust and consistent depth by large-scale self-supervised training, which will not be limited by the performance of off-the-shelf stereo matching algorithms.

Another novelty of our work is the confidence map. As monocular depth estimation itself is an ill-posed problem, although learning-based approaches achieve comparable results to stereo depth estimation, there are still many unavoidable mistakes in the predicted depth map. For the first time, our method is able to provide a pixel-wise confidence of the predicted depth by using a parallel DCNN to capture and learn the confidence during training. The confidence map will greatly improve the usability of deploying monocular depth estimation into many practical tasks.

3 Method

3.1 Framework Overview

Figure 2: Framework for proposed self-supervised monocular depth learning and confidence estimating networks.

Figure 2 illustrates the entire framework for our self-supervised monocular depth learning and confidence estimation networks. Since the ground-truth depth is absent for supervised training, we treat the monocular depth estimation as a problem of image synthesis error minimization during training. Specifically, during training, we use the left images of the stereo pairs to synthesize per-pixel depth using an encoder-decoder network , which is converted into disparities maps by the Equation 2. The disparities map is then used to guide the stereo view reconstruction and the sampling of patches . After that, the loss function is calculated based on Patch Matching Loss , View Reconstruction Loss , Disparity Smoothness Loss , and Disparity Consistency Loss . As these processes are differentiable, back propagation can be used to update the parameters of our depth learning network to minimize the total loss .


Since our patch-based ZNCC loss map represents the normalized inverted similarity between each pixel of the and , it can be approximated as the inverted confidence of the depth estimation result. We use the to self-supervise the training of a second encoder-decoder network – ConfidenceNet to generate the confidence of the per-pixel depth estimation of our DepthNet.

3.2 Depth Synthesis Network

Figure 3:

Depth synthesis network structure. ”k” is the kernel size, ”s” for the stride, ”c” for the channel number. For simplicity, we do not draw the conv layers after each conv and deconv layer, which have the same kernel and channel size as previous layers but with stride 1.

The core part of our framework is the depth synthesis and generation. Our goal is to learn an implicit function that estimates a per-pixel depth from a single input image. Inspired by the architectures of FlowNet [37], DispNet [38] and the network of Godard et al [5] and Zhou et al [6], we employ a VGG-like fully convolutional neural network architecture [39] in order to generate per-pixel depth from a single image. Our encoder-decoder model is illustrated in Figure 3

. The input image is encoded by 7 conv layers with stride 2 each followed by a conv layer with stride 1, which efficiently compress the input image into a feature tensor with

original size and 512 channels. Then, the feature tensor is up-sampled by 7 deConv layers with stride 2 each followed by a conv layer with stride 1, which decode the feature tensor into a full original size depth. Following the method in [37], 6 skip connections are implemented for preserving high-level information to ensure the high quality per-pixel prediction after up-sampling. Multi-scale depth images are outputted and used for further steps to constraint the network for a coarse-to-fine up-sampling.

3.3 Warping-based Stereo View Reconstruction

(a) Foward mapping
(b) Backward mapping
Figure 4: The difference between forward mapping and backward mapping.

View warping is an enabling technology for self-supervised learning framework [4] [5] [6]. Given the per-pixel disparity map estimated from a single image in the previous step, the target view of the stereo pairs can be reconstructed by the epipolar relationship in stereo vision. According to the epipolar constraint: the projection of a pixel on the right camera plane must be contained in the epipolar line. For calibrated stereo pairs discussed in this paper, and must be in the same row , and the disparity describes the horizontal displacement of the corresponding pixels and . Through the stereo triangulation, we can get that


where is the depth estimated in the pixel at , b and f are the camera baseline and focal distance. By the relationship discussed in the above equation, the target view in a stereo pair can be reconstructed given the source view and the corresponding depth (estimated through our depth synthesis network).

However, the direct mapping from one known view to the other view (forward mapping) will result in holes in the target image that are not differentiable. Therefore, we use the inverse mapping: for each pixel in the target view, by picking points from the source to reconstruct the target view guided by the . Thus, a complete and differentiable target view can be generated. Then the bilinear sampling [40]

is used to get the interpolated pixel value from the source view.

3.4 Disparity-guided Patch Sampling

Inspired by the stereo view reconstruction described above, we propose a novel patch sampling process guided by the estimated disparity from our DepthNet. is defined as a patch with window size , centered at the coordinate . We sample patches on each pixel in the left image , and the corresponding patches shifted by disparity values of each pixel in the right image, . According to Equation 2, if is correct, then we have

. And this relationship will be used to construct the patch matching loss. These sampled patches are computed and stored vectorized so that can be deployed parallelly on GPU for accelerated computation.

The patch sampling size is very important and can affect the final performance of similarity measurement. However, there is no optimal patch size and the performance varies greatly across different images and local details. When small patch size is used, little information will be captured, and the similarity comparison robustness will be decreased. If we use a large patch size, computational complexity will be greatly increased and also cannot recover accurate depth at stereo occlusion and depth discontinuous. Therefore, we use a multi-scale patch sampling scheme and sample a combination of 4 different patch sizes in an image to fully exploit the effects of different patch sizes. We will discuss the choice of patch sizes in Section 4.1.

3.5 Loss Function Construction

We define a loss function with multiple strategies to effectively train our networks for accurate, smooth and realistic depth.


where from left to right is: Patch Matching Loss, View Reconstruction Loss, Disparity Smoothness Loss and Disparity Consistency Loss. is the corresponding weights to balance the effects of gradients back propagation. Each loss function will be explained in details below:

3.5.1 Patch Matching Loss.

Inspired by patch matching algorithm that by finding the best-matched patches in the left and right image to get correct disparities. We propose a patch matching loss that maximize the similarities (minimize the differences) of patches in left image and the shifted patches in right image to get correct disparities. Here, the ZNCC measure of similarity is used to compute a normalized similarity between the patches and :


where is the mean intensity of the patch centered at the coordinate .

The ZNCC returns a similarity ranging from . We first normalize it into then invert it to get the patch matching loss:


Our patch matching loss is computed at all 4 patch sizes to cover both small structures and large areas. There are several advantages of using our patch-based ZNCC loss to regularize the depth synthesis:

(1) Our patch matching loss uses patches for measurement that involve larger regions than the direct pixel-wise photometric loss used in previous work, which is more robust and can achieve sub-pixel accuracy. Figure 5 demonstrates the effect of our patch-based ZNCC loss. We charted the values of our patch-based ZNCC loss and the photometric loss against the disparity value of a pixel located at the center of the image patch ”6”. It is obvious that by using our proposed patch-based ZNCC loss, the loss is more smooth and likely to converge to the global minimum. Whereas the direct pixel-wise photometric loss will lead to many local minimums shown in the right curve chart.

Figure 5: Comparison of our proposed patch-based ZNCC loss with the photometric loss used in previous works.

(2) Compared to other similarity measures such as absolute intensity difference (AD), Census, and Normalized Cross Correlation (NCC), ZNCC is especially robust against Gaussian noise and variation between the compared patches, which can help to recover more accurate depth in our self-supervised framework.

(3) As a zero-mean normalized similarity measurement function, our patch-based ZNCC loss can provide a similar value ranging from . After normalized to as shown in Equation 5, it can be regarded as the confidence of the generated depth at each pixel, which can be further used to self-supervise the training of our confidence network.

3.5.2 View Reconstruction Loss.

We use the view reconstruction loss as a second supervision on the depth synthesis. Guided by the synthesized depth, the right views can be reconstructed by collecting pixels from left images. The view reconstruction loss is defined as the L1 loss between the reconstructed view and the original view :


Compared to the patch matching loss, the view reconstruction L1 loss is more sensitive to small structures and depth discontinuities and can provide more detailed depth information.

3.5.3 Disparity Smoothness Loss.

We use a disparity smoothness term to regularize our network to produce more smooth depth. Similar to [4] [5] [6], we use the sum of the L1 norm of the disparity gradients along the and directions as a smoothness factor. The edge-aware terms are used to reduce the penalty on edges where depth discontinuities usually happen, which can prevent over-smoothing.


3.5.4 Disparity Consistency Loss.

The left-right disparity consistency loss proposed in [5] has achieved a great improvement for monocular depth generation. Here, we adopt this loss function into our framework. The left and right image disparities are both generated, and the difference of left disparity map and the reconstructed left disparity map from right disparity is computed and minimized. This loss will ensure the left and right disparities coherence.


3.6 Confidence Estimation Network

One of the advantages of our proposed patch matching loss is that a normalize similarity measurement can be generated for each pixel at the training time. With the well-known epipolar constraint, the per-pixel confidence of the estimated depth can be approximated as the normalized similarity measurement of the left patches and the corresponding patches in the right image.


Here, we propose to use another encoder-decoder network to learn the confidence map generated by our depth estimation network during training, so that the confidence map can be preserved and generated during the testing time. We tried to train the confidence and depth in one network like [25] [18] [26] [27], but the multi-task training would reduce the depth estimation performance. Therefore, we use a parallel encoder-decoder network to learn the confidence supervised by the per-pixel ZNCC loss of our depth estimation network. The loss of our ConfidencNet is shown below:


where is the generated confidence map, is the patch matching loss from our depth estimation network described in above sections. The static copy is used here to prevent the gradients propagating back to the depth estimation network. The operation inverts the loss to confidence, and L1 loss is used to access the confidence estimation error.

Instead of using the same encoder-decoder network structure as our DepthNet, we employ a simpler structure by only using first 5 conv-layer and last 5 deconv-layer without skip layers as described in Figure 3 for two reasons:

(1) To reduce memory usage and training time, as training two neural networks at the same time is very computationally expensive. The second network can be replaced by a deeper and more complex encoder-decoder network to produce sharper and more accurate confidence, but the main purpose of our work is to prove that our self-supervised monocular depth learning and confidence estimation framework is feasible and helpful for depth prediction, hence we choose to use a simple network structure as the proof of concept.

(2) We intend to use a simpler network with fewer weights to prevent over-fitting to noises and to learn more generic confidence – high confidence in texture-rich areas, low confidence in texture-less, blurry and occluded areas, which is what we design this confidence net for.

4 Experiments

In this section, we evaluate our framework and compare the results with prior approaches both quantitatively and qualitatively on KITTI dataset. We use the rectified stereo image pairs for training our networks. For testing time, we use the left image to generate depth, and the corresponding sparse LIDAR data is served as the ground truth for benchmarking.

4.1 Implementation Details.

Our networks are implemented in Tensorflow and trained on a workstation with a single Nvidia Titan X GPU (12G Memory). Our models take around 60 hours to train for 50 epochs. When in testing mode, our networks can output depth and confidence map at around 20 frames per second.

Hyper Parameters. All input images are scaled to 512x256 with a batch size of 4. Adam Optimizer is used with , , and initial learning rate that decays after half of the training process. The weights to construct our total loss function for depth estimation network are ,,,.

Data Augmentation. The same data augmentation approach in [5] is used to randomly flip the image and change the gamma, brightness, and color shifts to increase the network robustness and prevent over-fitting.

Multi-scale Implementation. We employ a multi-scale strategy to ensure a coarse-to-fine up-sampling. As can be seen from Figure 3, 4 depth scales are outputted with and a full resolution. All of our loss functions are computed for each of these 4 scales, and for each of left and right images/disparities. We take the means of these loss functions as the final loss.

Patch Size. By applying different patch sizes on different image scales, we can get very large equivalent patch sizes with less computation. For patch size choices, based on our empirical test, we use pixels for our patch-based ZNCC loss on 4 different scales, which is equivalent pixels’ windows on full resolution images.

4.2 KITTI dataset.

To be able to compare with the state-of-the-art monocular depth learning approaches, we trained and evaluated our networks using two different train/test splits: Godard and Eigen.

Godard Split. We use the same train/test sets that Godard et al [5] proposed in their work. 200 high quality disparity images in 28 scenes provided by the official KITTI training set are served as the ground truth for benchmarking. For the rest of 33 scenes with a total of 30,159 images, 29,000 images are picked for training and the remaining 1,159 images for testing.

Eigen Split. For fair comparison with more previous works, we also use the test split proposed by Eigen et al [12] that has been widely evaluated by the works of Garg et al [4], Liu et al [21], Zhou et al [6] and Godard et al [5]. This test split contains 697 images of 29 scenes. The rest of 32 scenes contain 23,488 images, in which 22,600 are used for training and the remaining for testing, similar to [4] and [5].

4.3 Results

4.3.1 Quantitative Evaluation.

The evaluation results on the KITTI dataset are reported in Table 1. We use different combinations of train/test splits (E for Eigen, G for Godard) and cap distances (80m and 50m) to compare with different works. For Eigen et al [12], Liu et al [21], Zhou et al [6] and Godard et al [5] , the Eigen split with 80m cap distance are used. For Garg et al [4], Zhou et al [6] and Godard et al [5], the Eigen split with 50m cap distance are used. We also report our result on Godard split with 80m cap. The results shows that our method outperforms all compared methods and produce the state-of-the-art results for monocular depth estimation problem on KITTI dataset.

Method Super- vision Split Cap Error (Lower better) Accuracy (Higher better)
AbsRel SqRel RMSE RMSElog D1-all
Eigen et al [12] Yes E 80 0.203 1.548 6.307 0.282 - 0.702 0.890 0.958
Liu et al [21] Yes E 80 0.201 1.584 6.471 0.273 - 0.680 0.898 0.967
Zhou et al [6] No E 80 0.208 1.768 6.856 0.283 - 0.678 0.885 0.957
Godard et al [5] No E 80 0.148 1.344 5.927 0.247 - 0.803 0.922 0.964
Ours No E 80 0.145 1.267 5.786 0.244 - 0.811 0.925 0.965
Garg et al [4] No E 50 0.169 1.080 5.104 0.273 - 0.740 0.904 0.962
Zhou et al [6] No E 50 0.201 1.391 5.181 0.264 - 0.696 0.900 0.966
Godard et al [5] No E 50 0.140 0.976 4.471 0.232 - 0.818 0.931 0.969
Ours No E 50 0.138 0.937 4.399 0.231 - 0.825 0.933 0.969
Godard et al [5] No G 80 0.124 1.388 6.125 0.217 30.272 0.841 0.936 0.975
Ours No G 80 0.117 1.202 5.953 0.210 29.612 0.845 0.938 0.976
Table 1: Comparison with state-of-the-art methods on KITTI dataset.

4.3.2 Qualitative Evaluation.

The qualitative comparison to some of the related methods on KITTI dataset is shown in Figure 6. While our network structure is similar to that of Godard et al[5], both generate clear and accurate depth than other works. We also provide a detailed comparison with the results of Godard et al[5] in the lower part of Figure 6. Our network can generate more accurate depth in complex regions with thin structures and texture-less areas such as the pillars and traffic signs. This verified the theory we explained in Figure 5 that our patch-based loss function is more robust and easier to converge to the global minimum in complex regions.

Input Ground-truth Garg et al[4] Zhou et al[6] Godard et al[5] Ours

Figure 6: Upper part: comparison of monocular depth estimation on KITTI dataset between Garg et al[4], Zhou et al[6], Godard et al[5], and ours. Lower part: comparison of details with Godard et al[5]. All of the results are generated using authors’ provided pre-trainned model. The ground-truth depth map is interpolated from sparse point map only for visualization.

4.3.3 Confidence Map Evaluation.

We show the confidence estimation results in Figure 7. A colorbar from red to yellow is used to represent 0 to 1. We can see that the estimated confidence can nicely represent the inverted ZNCC loss but less noisy due to the small network we use to prevent over-fitting. The overlaid confidence on input image shows that our ConfidenceNet has learned to generate confidence from contextual information. For example, in texture-less areas (sky, building), dark areas (trees under shadow), occluded areas (around thin structures) and reflective areas (car window), the estimated confidence is usually very low. While the texture-rich areas and edges usually have high confidence.

Input Est. Depth ZNCC Loss Est. Confidence Overlay
Figure 7: Confidence estimation results. A colorbar from red to yellow is used to represent 0 to 1.

5 Discussion

In this paper, we have presented a novel self-supervised framework for monocular depth learning and confidence estimation. We incorporate the patch matching theory into a fully differential DCNN and achieve self-supervised training of both depth and the confidence of depth. Our proposed loss function exploits the epipolar constraint of stereo vision and also provides a normalized similarity that is further used to supervise the confidence estimation. Our method not only outperforms the state-of-the-art results on the KITTI benchmark evaluation, but also for the first time, we are able to simultaneously generate depth from monocular images and estimate the confidence of the generated depth. This is a step change for monocular depth estimation as it significantly increases the feasibility of using monocular depth estimation into many practical applications such as autonomous driving and monocular endoscopic surgery, where the accuracy of estimated depth is crucial.

Why Our ConfidenceNet Works? Since our ConfidenceNet is supervised by the per-pixel ZNCC loss of our depth estimation network, it explicitly learns the regions where our depth estimation network performs well and badly. But on a deeper level, our ConfidenceNet actually implicitly learns the inherent defect of the patch matching algorithm – it would fail on texture-less regions and performs badly near stereo view occlusions, reflections and blurred areas. Therefore, after sufficient training steps, our ConfidenceNet can give an estimation of the confidence of our DepthNet, although they are two different networks.

In Future Work. We will continue optimizing our model and explore the possibility of using adaptive window size for patch sampling to decrease the training time and increase accuracy in small structures.


  • [1] Dunkin, B.J., Flowers, C.: 3d in the minimally invasive surgery (mis) operating room: Cameras and displays in the evolution of mis. In: Imaging and Visualization in The Modern Operating Room. Springer (2015) 145–155
  • [2] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q., eds.: Advances in Neural Information Processing Systems 25. Curran Associates, Inc. (2012) 1097–1105
  • [3] Zhang, Z.: Determining the epipolar geometry and its uncertainty: A review. International Journal of Computer Vision 27(2) (Mar 1998) 161–195
  • [4] Garg, R., B.G., V.K., Carneiro, G., Reid, I.: Unsupervised cnn for single view depth estimation: Geometry to the rescue. In Leibe, B., Matas, J., Sebe, N., Welling, M., eds.: Computer Vision – ECCV 2016, Cham, Springer International Publishing (2016) 740–756
  • [5] Godard, C., Aodha, O.M., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency.

    In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (July 2017) 6602–6611

  • [6] Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (July 2017) 6612–6619
  • [7] Tateno, K., Tombari, F., Laina, I., Navab, N.: Cnn-slam: Real-time dense monocular slam with learned depth prediction. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (July 2017) 6565–6574
  • [8] Barnard, S.T., Fischler, M.A.: Computational stereo. ACM Comput. Surv. 14(4) (December 1982) 553–572
  • [9] Scharstein, D., Szeliski, R., Zabih, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. In: Proceedings IEEE Workshop on Stereo and Multi-Baseline Vision (SMBV 2001). (2001) 131–140
  • [10] Hirschmuller, H.: Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(2) (February 2008) 328–341
  • [11] Kendall, A., Martirosyan, H., Dasgupta, S., Henry, P.: End-to-end learning of geometry and context for deep stereo regression. In: 2017 IEEE International Conference on Computer Vision (ICCV). (October 2017) 66–75
  • [12] Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. NIPS’14, Cambridge, MA, USA, MIT Press (2014) 2366–2374
  • [13] Zhang, R., Tsai, P.S., Cryer, J.E., Shah, M.: Shape-from-shading: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 21(8) (Aug 1999) 690–706
  • [14] Saxena, A., Chung, S.H., Ng, A.Y.: Learning depth from single monocular images. In Weiss, Y., Schölkopf, B., Platt, J.C., eds.: Advances in Neural Information Processing Systems 18. MIT Press (2006) 1161–1168
  • [15] Saxena, A., Schulte, J., Ng, A.Y.: Depth estimation using monocular and stereo cues. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence. IJCAI’07, San Francisco, CA, USA, Morgan Kaufmann Publishers Inc. (2007) 2197–2203
  • [16] Saxena, A., Chung, S.H., Ng, A.Y.: 3-d depth reconstruction from a single still image. International Journal of Computer Vision 76(1) (Jan 2008) 53–69
  • [17] Saxena, A., Sun, M., Ng, A.Y.: Make3d: Learning 3D scene structure from a single still image. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(5) (May 2009) 824–840
  • [18] Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: 2015 IEEE International Conference on Computer Vision (ICCV). (December 2015) 2650–2658
  • [19] Li, B., Shen, C., Dai, Y., van den Hengel, A., He, M.:

    Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs.

    In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (June 2015) 1119–1127
  • [20] Liu, M., Salzmann, M., He, X.: Discrete-continuous depth estimation from a single image. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (June 2014) 716–723
  • [21] Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(10) (October 2016) 2024–2039
  • [22] Xu, D., Ricci, E., Ouyang, W., Wang, X., Sebe, N.: Multi-scale continuous crfs as sequential deep networks for monocular depth estimation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (July 2017) 161–169
  • [23] Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks. In: 3D Vision (3DV), 2016 Fourth International Conference on. (October 2016) 239–248
  • [24] Li, J., Klein, R., Yao, A.: A two-streamed network for estimating fine-scaled depth maps from single rgb images. In: 2017 IEEE International Conference on Computer Vision (ICCV). (October 2017) 3392–3400
  • [25] Ladický, L., Shi, J., Pollefeys, M.: Pulling things out of perspective. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). CVPR ’14, Washington, DC, USA, IEEE Computer Society (2014) 89–96
  • [26] Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B., Yuille, A.: Towards unified depth and semantic prediction from a single image. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (June 2015) 2800–2809
  • [27] Mousavian, A., Pirsiavash, H., Košecká, J.: Joint semantic segmentation and depth estimation with deep convolutional networks. In: 3D Vision (3DV), 2016 Fourth International Conference on, IEEE (2016) 611–619
  • [28] Kuznietsov, Y., Stückler, J., Leibe, B.:

    Semi-supervised deep learning for monocular depth map prediction.

    In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (July 2017) 2215–2223
  • [29] Zoran, D., Isola, P., Krishnan, D., Freeman, W.T.: Learning ordinal relationships for mid-level vision. In: 2015 IEEE International Conference on Computer Vision (ICCV). (Dec 2015) 388–396
  • [30] Chen, W., Fu, Z., Yang, D., Deng, J.: Single-image depth perception in the wild. In Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R., eds.: Advances in Neural Information Processing Systems 29. Curran Associates, Inc. (2016) 730–738
  • [31] Cao, Y., Wu, Z., Shen, C.: Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Transactions on Circuits and Systems for Video Technology PP(99) (2017)  1
  • [32] Fitzgibbon, A., Wexler, Y., Zisserman, A.: Image-based rendering using image-based priors. In: 2003 IEEE International Conference on Computer Vision (ICCV). (Oct 2003) 1176–1183 vol.2
  • [33] Zhou, T., Tulsiani, S., Sun, W., Malik, J., Efros, A.A.: View synthesis by appearance flow. In: European Conference on Computer Vision. (2016)
  • [34] Flynn, J., Neulander, I., Philbin, J., Snavely, N.: Deep stereo: Learning to predict new views from the world’s imagery. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (June 2016) 5515–5524
  • [35] Xie, J., Girshick, R., Farhadi, A.: Deep3d: Fully automatic 2d-to-3d video conversion with deep convolutional neural networks. In Leibe, B., Matas, J., Sebe, N., Welling, M., eds.: Computer Vision – ECCV 2016, Cham, Springer International Publishing (2016) 842–857
  • [36] Luo, Y., Ren, J., Lin, M., Pang, J., Sun, W., Li, H., Lin, L.: Single view stereo matching. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2018)
  • [37] Dosovitskiy, A., Springenberg, J.T., Tatarchenko, M., Brox, T.: Learning to generate chairs, tables and cars with convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(4) (April 2017) 692–705
  • [38] Mayer, N., Ilg, E., Häusser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox, T.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (June 2016) 4040–4048
  • [39] Shelhamer, E., Long, J., Darrell, T.: Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(4) (April 2017) 640–651
  • [40] Jaderberg, M., Simonyan, K., Zisserman, A., kavukcuoglu, k.: Spatial transformer networks. In Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R., eds.: Advances in Neural Information Processing Systems 28. Curran Associates, Inc. (2015) 2017–2025