Convolutional Neural Networks (CNNs) need large amounts of data with ground truth annotation, which is a challenging problem that has limited the development and fast deployment of CNNs for many computer vision tasks. We propose a novel framework for depth estimation from monocular images with corresponding confidence in a self-supervised manner. A fully differential patch-based cost function is proposed by using the Zero-Mean Normalized Cross Correlation (ZNCC) that takes multi-scale patches as a matching strategy. This approach greatly increases the accuracy and robustness of the depth learning. In addition, the proposed patch-based cost function can provide a 0 to 1 confidence, which is then used to supervise the training of a parallel network for confidence map learning and estimation. Evaluation on KITTI dataset shows that our method outperforms the state-of-the-art results.READ FULL TEXT VIEW PDF
Self-supervised monocular depth estimation has become an appealing solut...
Self-supervised paradigms for monocular depth estimation are very appeal...
Estimating depth from a single image represents an attractive alternativ...
In most computer vision applications, convolutional neural networks (CNN...
Estimating depth from a single image represents an attractive alternativ...
Generally, convolutional neural networks (CNNs) process data on a regula...
The condition of a building is an important factor for real estate valua...
The human vision system is amazingly complex and extremely delicate. It can perceive depth through stereopsis, which relies on the displacement of the same object between the images received by the left and right retinas 
. With extensive visual experience and through trial and error, humans develop the ability to use contextual depth cues to achieve good and reliable perception of depth and better understanding of spatial structure. Among these depth cues, some of them do not rely on stereopsis, such as object occlusion, perspective, familiar and relative size, depth from motion, lighting and shading. Therefore, if blind in one eye or if performing a monocular task such as endoscopic surgery, we can still judge distance from these many different intuitive depth cues. In contrast, when using machine vision it is hard to infer the non-stereopsis depth cues. With the recent development of Deep Convolutional Neural Networks (DCNNs), machines can solve many computer vision problems when provided with very large human annotated datasets such as ImageNet
, which is known as supervised learning. Acquisition of labelled datasets is one of the biggest challenges for supervised learning, however, which is an expensive, time-consuming and labour-intensive task.
In this paper, we propose a novel self-supervised computational framework that mimics the process of how a human learns varies of contextual depth cues from stereopsis. We train a DCNN for synthesizing depth from one view of the stereo image pair, then reconstruct the other view by the synthesized depth, and finally using the stereo vision epipolar constraint  to minimize the error of the depth synthesis.
Our approach does not require the ground truth depth for supervised training. Instead, we derive the implicit function of estimating depth from monocular images by the epipolar constraint of the stereo image pair. Therefore, the method can be regarded as self-supervised learning. Compared with previous work    addressing the same problem, we incorporate a patch-based image evaluation strategy, inspired by the classic patch matching algorithms for finding the best-matched patches between the left and right images. We use the Zero-Mean Normalized Cross Correlation (ZNCC) to measure the normalized similarities between these patches. A fully-differential patch-based ZNCC cost function is implemented to guide the depth synthesis process for more accurate results. Visual assessment shows that our approach can produce more accurate and robust depth estimations in both texture-rich and texture-less areas due to the enlargement of matching field from a pixel to a patch (see Figure 5). Empirical evaluations on KITTI dataset demonstrate the effectiveness of our approach and produce a state-of-the-art performance in monocular depth estimation task.
Our second contribution is that we train a parallel DCNN to evaluate the performance of the monocular depth estimation and output a 0 to 1 confidence map. The parallel DCNN is also trained in a self-supervised manner thanks to our ZNCC similarity measurement function. As ZNCC is a normalized measure of similarity, which can be approximated as the confidence of the depth estimation, we take the ZNCC loss to self-supervise the parallel DCNN (ConfidenceNet) during training so that we can estimate the confidence of the depth estimated from the first DCNN (DepthNet) during testing mode as shown in Figure 1. A confidence map is extremely useful for the monocular depth estimation task trained in an unsupervised manner, as the learned epipolar constraint only works well when there are clear corresponding pixels between the image pairs; it will fail and produce uncertain depth when occlusion and specularity exist in images. Our confidence map can give a basic assessment of the reliability of the predicted depth, which can then be further integrated into many applications such as monocular dense reconstruction, SLAM-based depth fusion , and many tasks need crucial accurate and confidence such as monocular endoscopic surgery.
The problem of stereo images depth estimation has been well studied for a long time  . With the theory of epipolar constraint, accessing depth from stereo images can be regarded as a well-posed problem when ignoring the occlusions and depth discontinuities. Many stereo vision algorithms managed to achieve comparable results to ground truth depth acquired from depth sensors  .
In contrast, estimating depth from monocular images is an ill-posed problem that is inherently ambiguous , and many research efforts have been devoted to the problem of monocular image depth estimation. One of the classic methods is Shape from Shading (SFS) , which is based on the gradual variation of shading as a cue to estimate the shape and depth. However, SFS has a strict prior assumption of Lambertian reflectance, uniform color and texture, and fixed light source direction, which are not applicable to most of the images in the real world. Saxena et al  used Markov Random Field (MRF) incorporated with multiscale image features to learn monocular cues in a supervised manner. However, the hand-craft local features used in these approaches limit the expressive power of supervised learning, and lack a global contextual understanding of the scene for learning consistent depth.
More recently, DCNNs   are introduced to solve the challenge of monocular depth estimation problem, and has pushed the state-of-the-art forward in this area. Building on the success of this approach, several improvements have been made by incorporating probabilistic models such as Conditional Random Fields (CRFs)   , advanced network structures such as Resnet , two-streamed networks , multi-task joint training    
and novel loss functions such as sparse supervision, relative depth  and depth as classification . Impressive as these works are, ground-truth depth data are still needed for the supervision of training these DCNNs.
, which enables novel frameworks of unsupervised learning of monocular depth from stereo pairs, e.g., Deep3D, Garg et al . The works by Godard et al  and Zhou et al  advanced the networks by incorporating left-right consistency and pose estimations. However, a common weakness of these approaches is the use of pixel-wised photometric loss (L1-norm) to construct loss functions to guide the back-propagation process. Gradients are derived from the pixel intensity difference , which will lead to ambiguous gradients in texture-less areas and also in the regions that contain the mixture of thin structures and texture-less areas. Although multi-scale and smoothness loss functions are used to prevent such issue   , the result is still not desirable and gradients are still likely to converge to local minimums due to the ambiguous pixel-wise loss. As shown in Figure 5, in a common speed limitation board area from the KITTI dataset, the direct pixel-wise photometric loss will lead to many local minimums shown in the right curve chart. While as the left curve chart shows the result of using our proposed patch-based ZNCC loss, the loss is more smooth and likely to converge to the global minimum. And the experiment result (the last row in Figure 5) shows our proposed method can effectively generate accurate depth in complex regions.
We propose a novel multi-scale patch-based cost function that adopts the ZNCC as a similarity function to explicitly enlarge the matching field and increase the matching robustness. From another point of view, our proposed patch-based cost function implicitly integrate the classic Patch Matching (PM) algorithm as a minimization problem in our loss function. Although Garg et al  have discussed a straightforward idea of using the stereo matching algorithm as a pre-processing method to generate ”quasi ground-truth” depth for training, their result is not desirable due to the poor quality of ”quasi ground-truth”. Recently, Luo et al  also proposed a similar framework that firstly use a DCNN to synthesize stereo pairs from single images, and then use stereo matching to get depth. In contrast to these two works, we treat the stereo matching as a minimization problem and implement a fully differential PM algorithm as a cost function that is seamlessly integrated into our neural network. As the loss of the PM cost function can be passed through the whole network during a backward propagation, our network can produce more robust and consistent depth by large-scale self-supervised training, which will not be limited by the performance of off-the-shelf stereo matching algorithms.
Another novelty of our work is the confidence map. As monocular depth estimation itself is an ill-posed problem, although learning-based approaches achieve comparable results to stereo depth estimation, there are still many unavoidable mistakes in the predicted depth map. For the first time, our method is able to provide a pixel-wise confidence of the predicted depth by using a parallel DCNN to capture and learn the confidence during training. The confidence map will greatly improve the usability of deploying monocular depth estimation into many practical tasks.
Figure 2 illustrates the entire framework for our self-supervised monocular depth learning and confidence estimation networks. Since the ground-truth depth is absent for supervised training, we treat the monocular depth estimation as a problem of image synthesis error minimization during training. Specifically, during training, we use the left images of the stereo pairs to synthesize per-pixel depth using an encoder-decoder network , which is converted into disparities maps by the Equation 2. The disparities map is then used to guide the stereo view reconstruction and the sampling of patches . After that, the loss function is calculated based on Patch Matching Loss , View Reconstruction Loss , Disparity Smoothness Loss , and Disparity Consistency Loss . As these processes are differentiable, back propagation can be used to update the parameters of our depth learning network to minimize the total loss .
Since our patch-based ZNCC loss map represents the normalized inverted similarity between each pixel of the and , it can be approximated as the inverted confidence of the depth estimation result. We use the to self-supervise the training of a second encoder-decoder network – ConfidenceNet to generate the confidence of the per-pixel depth estimation of our DepthNet.
The core part of our framework is the depth synthesis and generation. Our goal is to learn an implicit function that estimates a per-pixel depth from a single input image. Inspired by the architectures of FlowNet , DispNet  and the network of Godard et al  and Zhou et al , we employ a VGG-like fully convolutional neural network architecture  in order to generate per-pixel depth from a single image. Our encoder-decoder model is illustrated in Figure 3
. The input image is encoded by 7 conv layers with stride 2 each followed by a conv layer with stride 1, which efficiently compress the input image into a feature tensor withoriginal size and 512 channels. Then, the feature tensor is up-sampled by 7 deConv layers with stride 2 each followed by a conv layer with stride 1, which decode the feature tensor into a full original size depth. Following the method in , 6 skip connections are implemented for preserving high-level information to ensure the high quality per-pixel prediction after up-sampling. Multi-scale depth images are outputted and used for further steps to constraint the network for a coarse-to-fine up-sampling.
View warping is an enabling technology for self-supervised learning framework   . Given the per-pixel disparity map estimated from a single image in the previous step, the target view of the stereo pairs can be reconstructed by the epipolar relationship in stereo vision. According to the epipolar constraint: the projection of a pixel on the right camera plane must be contained in the epipolar line. For calibrated stereo pairs discussed in this paper, and must be in the same row , and the disparity describes the horizontal displacement of the corresponding pixels and . Through the stereo triangulation, we can get that
where is the depth estimated in the pixel at , b and f are the camera baseline and focal distance. By the relationship discussed in the above equation, the target view in a stereo pair can be reconstructed given the source view and the corresponding depth (estimated through our depth synthesis network).
However, the direct mapping from one known view to the other view (forward mapping) will result in holes in the target image that are not differentiable. Therefore, we use the inverse mapping: for each pixel in the target view, by picking points from the source to reconstruct the target view guided by the . Thus, a complete and differentiable target view can be generated. Then the bilinear sampling 
is used to get the interpolated pixel value from the source view.
Inspired by the stereo view reconstruction described above, we propose a novel patch sampling process guided by the estimated disparity from our DepthNet. is defined as a patch with window size , centered at the coordinate . We sample patches on each pixel in the left image , and the corresponding patches shifted by disparity values of each pixel in the right image, . According to Equation 2, if is correct, then we have
. And this relationship will be used to construct the patch matching loss. These sampled patches are computed and stored vectorized so that can be deployed parallelly on GPU for accelerated computation.
The patch sampling size is very important and can affect the final performance of similarity measurement. However, there is no optimal patch size and the performance varies greatly across different images and local details. When small patch size is used, little information will be captured, and the similarity comparison robustness will be decreased. If we use a large patch size, computational complexity will be greatly increased and also cannot recover accurate depth at stereo occlusion and depth discontinuous. Therefore, we use a multi-scale patch sampling scheme and sample a combination of 4 different patch sizes in an image to fully exploit the effects of different patch sizes. We will discuss the choice of patch sizes in Section 4.1.
We define a loss function with multiple strategies to effectively train our networks for accurate, smooth and realistic depth.
where from left to right is: Patch Matching Loss, View Reconstruction Loss, Disparity Smoothness Loss and Disparity Consistency Loss. is the corresponding weights to balance the effects of gradients back propagation. Each loss function will be explained in details below:
Inspired by patch matching algorithm that by finding the best-matched patches in the left and right image to get correct disparities. We propose a patch matching loss that maximize the similarities (minimize the differences) of patches in left image and the shifted patches in right image to get correct disparities. Here, the ZNCC measure of similarity is used to compute a normalized similarity between the patches and :
where is the mean intensity of the patch centered at the coordinate .
The ZNCC returns a similarity ranging from . We first normalize it into then invert it to get the patch matching loss:
Our patch matching loss is computed at all 4 patch sizes to cover both small structures and large areas. There are several advantages of using our patch-based ZNCC loss to regularize the depth synthesis:
(1) Our patch matching loss uses patches for measurement that involve larger regions than the direct pixel-wise photometric loss used in previous work, which is more robust and can achieve sub-pixel accuracy. Figure 5 demonstrates the effect of our patch-based ZNCC loss. We charted the values of our patch-based ZNCC loss and the photometric loss against the disparity value of a pixel located at the center of the image patch ”6”. It is obvious that by using our proposed patch-based ZNCC loss, the loss is more smooth and likely to converge to the global minimum. Whereas the direct pixel-wise photometric loss will lead to many local minimums shown in the right curve chart.
(2) Compared to other similarity measures such as absolute intensity difference (AD), Census, and Normalized Cross Correlation (NCC), ZNCC is especially robust against Gaussian noise and variation between the compared patches, which can help to recover more accurate depth in our self-supervised framework.
(3) As a zero-mean normalized similarity measurement function, our patch-based ZNCC loss can provide a similar value ranging from . After normalized to as shown in Equation 5, it can be regarded as the confidence of the generated depth at each pixel, which can be further used to self-supervise the training of our confidence network.
We use the view reconstruction loss as a second supervision on the depth synthesis. Guided by the synthesized depth, the right views can be reconstructed by collecting pixels from left images. The view reconstruction loss is defined as the L1 loss between the reconstructed view and the original view :
Compared to the patch matching loss, the view reconstruction L1 loss is more sensitive to small structures and depth discontinuities and can provide more detailed depth information.
We use a disparity smoothness term to regularize our network to produce more smooth depth. Similar to   , we use the sum of the L1 norm of the disparity gradients along the and directions as a smoothness factor. The edge-aware terms are used to reduce the penalty on edges where depth discontinuities usually happen, which can prevent over-smoothing.
The left-right disparity consistency loss proposed in  has achieved a great improvement for monocular depth generation. Here, we adopt this loss function into our framework. The left and right image disparities are both generated, and the difference of left disparity map and the reconstructed left disparity map from right disparity is computed and minimized. This loss will ensure the left and right disparities coherence.
One of the advantages of our proposed patch matching loss is that a normalize similarity measurement can be generated for each pixel at the training time. With the well-known epipolar constraint, the per-pixel confidence of the estimated depth can be approximated as the normalized similarity measurement of the left patches and the corresponding patches in the right image.
Here, we propose to use another encoder-decoder network to learn the confidence map generated by our depth estimation network during training, so that the confidence map can be preserved and generated during the testing time. We tried to train the confidence and depth in one network like    , but the multi-task training would reduce the depth estimation performance. Therefore, we use a parallel encoder-decoder network to learn the confidence supervised by the per-pixel ZNCC loss of our depth estimation network. The loss of our ConfidencNet is shown below:
where is the generated confidence map, is the patch matching loss from our depth estimation network described in above sections. The static copy is used here to prevent the gradients propagating back to the depth estimation network. The operation inverts the loss to confidence, and L1 loss is used to access the confidence estimation error.
Instead of using the same encoder-decoder network structure as our DepthNet, we employ a simpler structure by only using first 5 conv-layer and last 5 deconv-layer without skip layers as described in Figure 3 for two reasons:
(1) To reduce memory usage and training time, as training two neural networks at the same time is very computationally expensive. The second network can be replaced by a deeper and more complex encoder-decoder network to produce sharper and more accurate confidence, but the main purpose of our work is to prove that our self-supervised monocular depth learning and confidence estimation framework is feasible and helpful for depth prediction, hence we choose to use a simple network structure as the proof of concept.
(2) We intend to use a simpler network with fewer weights to prevent over-fitting to noises and to learn more generic confidence – high confidence in texture-rich areas, low confidence in texture-less, blurry and occluded areas, which is what we design this confidence net for.
In this section, we evaluate our framework and compare the results with prior approaches both quantitatively and qualitatively on KITTI dataset. We use the rectified stereo image pairs for training our networks. For testing time, we use the left image to generate depth, and the corresponding sparse LIDAR data is served as the ground truth for benchmarking.
Our networks are implemented in Tensorflow and trained on a workstation with a single Nvidia Titan X GPU (12G Memory). Our models take around 60 hours to train for 50 epochs. When in testing mode, our networks can output depth and confidence map at around 20 frames per second.
Hyper Parameters. All input images are scaled to 512x256 with a batch size of 4. Adam Optimizer is used with , , and initial learning rate that decays after half of the training process. The weights to construct our total loss function for depth estimation network are ,,,.
Data Augmentation. The same data augmentation approach in  is used to randomly flip the image and change the gamma, brightness, and color shifts to increase the network robustness and prevent over-fitting.
Multi-scale Implementation. We employ a multi-scale strategy to ensure a coarse-to-fine up-sampling. As can be seen from Figure 3, 4 depth scales are outputted with and a full resolution. All of our loss functions are computed for each of these 4 scales, and for each of left and right images/disparities. We take the means of these loss functions as the final loss.
Patch Size. By applying different patch sizes on different image scales, we can get very large equivalent patch sizes with less computation. For patch size choices, based on our empirical test, we use pixels for our patch-based ZNCC loss on 4 different scales, which is equivalent pixels’ windows on full resolution images.
To be able to compare with the state-of-the-art monocular depth learning approaches, we trained and evaluated our networks using two different train/test splits: Godard and Eigen.
Godard Split. We use the same train/test sets that Godard et al  proposed in their work. 200 high quality disparity images in 28 scenes provided by the official KITTI training set are served as the ground truth for benchmarking. For the rest of 33 scenes with a total of 30,159 images, 29,000 images are picked for training and the remaining 1,159 images for testing.
Eigen Split. For fair comparison with more previous works, we also use the test split proposed by Eigen et al  that has been widely evaluated by the works of Garg et al , Liu et al , Zhou et al  and Godard et al . This test split contains 697 images of 29 scenes. The rest of 32 scenes contain 23,488 images, in which 22,600 are used for training and the remaining for testing, similar to  and .
The evaluation results on the KITTI dataset are reported in Table 1. We use different combinations of train/test splits (E for Eigen, G for Godard) and cap distances (80m and 50m) to compare with different works. For Eigen et al , Liu et al , Zhou et al  and Godard et al  , the Eigen split with 80m cap distance are used. For Garg et al , Zhou et al  and Godard et al , the Eigen split with 50m cap distance are used. We also report our result on Godard split with 80m cap. The results shows that our method outperforms all compared methods and produce the state-of-the-art results for monocular depth estimation problem on KITTI dataset.
|Method||Super- vision||Split||Cap||Error (Lower better)||Accuracy (Higher better)|
|Eigen et al ||Yes||E||80||0.203||1.548||6.307||0.282||-||0.702||0.890||0.958|
|Liu et al ||Yes||E||80||0.201||1.584||6.471||0.273||-||0.680||0.898||0.967|
|Zhou et al ||No||E||80||0.208||1.768||6.856||0.283||-||0.678||0.885||0.957|
|Godard et al ||No||E||80||0.148||1.344||5.927||0.247||-||0.803||0.922||0.964|
|Garg et al ||No||E||50||0.169||1.080||5.104||0.273||-||0.740||0.904||0.962|
|Zhou et al ||No||E||50||0.201||1.391||5.181||0.264||-||0.696||0.900||0.966|
|Godard et al ||No||E||50||0.140||0.976||4.471||0.232||-||0.818||0.931||0.969|
|Godard et al ||No||G||80||0.124||1.388||6.125||0.217||30.272||0.841||0.936||0.975|
The qualitative comparison to some of the related methods on KITTI dataset is shown in Figure 6. While our network structure is similar to that of Godard et al, both generate clear and accurate depth than other works. We also provide a detailed comparison with the results of Godard et al in the lower part of Figure 6. Our network can generate more accurate depth in complex regions with thin structures and texture-less areas such as the pillars and traffic signs. This verified the theory we explained in Figure 5 that our patch-based loss function is more robust and easier to converge to the global minimum in complex regions.
|Input||Ground-truth||Garg et al||Zhou et al||Godard et al||Ours|
We show the confidence estimation results in Figure 7. A colorbar from red to yellow is used to represent 0 to 1. We can see that the estimated confidence can nicely represent the inverted ZNCC loss but less noisy due to the small network we use to prevent over-fitting. The overlaid confidence on input image shows that our ConfidenceNet has learned to generate confidence from contextual information. For example, in texture-less areas (sky, building), dark areas (trees under shadow), occluded areas (around thin structures) and reflective areas (car window), the estimated confidence is usually very low. While the texture-rich areas and edges usually have high confidence.
|Input||Est. Depth||ZNCC Loss||Est. Confidence||Overlay|
In this paper, we have presented a novel self-supervised framework for monocular depth learning and confidence estimation. We incorporate the patch matching theory into a fully differential DCNN and achieve self-supervised training of both depth and the confidence of depth. Our proposed loss function exploits the epipolar constraint of stereo vision and also provides a normalized similarity that is further used to supervise the confidence estimation. Our method not only outperforms the state-of-the-art results on the KITTI benchmark evaluation, but also for the first time, we are able to simultaneously generate depth from monocular images and estimate the confidence of the generated depth. This is a step change for monocular depth estimation as it significantly increases the feasibility of using monocular depth estimation into many practical applications such as autonomous driving and monocular endoscopic surgery, where the accuracy of estimated depth is crucial.
Why Our ConfidenceNet Works? Since our ConfidenceNet is supervised by the per-pixel ZNCC loss of our depth estimation network, it explicitly learns the regions where our depth estimation network performs well and badly. But on a deeper level, our ConfidenceNet actually implicitly learns the inherent defect of the patch matching algorithm – it would fail on texture-less regions and performs badly near stereo view occlusions, reflections and blurred areas. Therefore, after sufficient training steps, our ConfidenceNet can give an estimation of the confidence of our DepthNet, although they are two different networks.
In Future Work. We will continue optimizing our model and explore the possibility of using adaptive window size for patch sampling to decrease the training time and increase accuracy in small structures.
In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (July 2017) 6602–6611
Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs.In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (June 2015) 1119–1127
Semi-supervised deep learning for monocular depth map prediction.In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (July 2017) 2215–2223