Code Repo for "Single View Stereo Matching" (CVPR'18 Spotlight)
Previous monocular depth estimation methods take a single view and directly regress the expected results. Though recent advances are made by applying geometrically inspired loss functions during training, the inference procedure does not explicitly impose any geometrical constraint. Therefore these models purely rely on the quality of data and the effectiveness of learning to generalize. This either leads to suboptimal results or the demand of huge amount of expensive ground truth labelled data to generate reasonable results. In this paper, we show for the first time that the monocular depth estimation problem can be reformulated as two sub-problems, a view synthesis procedure followed by stereo matching, with two intriguing properties, namely i) geometrical constraints can be explicitly imposed during inference; ii) demand on labelled depth data can be greatly alleviated. We show that the whole pipeline can still be trained in an end-to-end fashion and this new formulation plays a critical role in advancing the performance. The resulting model outperforms all the previous monocular depth estimation methods as well as the stereo block matching method in the challenging KITTI dataset by only using a small number of real training data. The model also generalizes well to other monocular depth estimation benchmarks. We also discuss the implications and the advantages of solving monocular depth estimation using stereo methods.READ FULL TEXT VIEW PDF
At present, deep learning has been applied more and more in monocular im...
Depth estimation from a single image represents a fascinating, yet
Most existing algorithms for depth estimation from single monocular imag...
Recently, there has been a paradigm shift in stereo matching with
Convolutional Neural Network (CNN) techniques are applied to the problem...
This paper addresses the problem of Monocular Depth Estimation (MDE).
Learning based approaches for depth perception are limited by the
Code Repo for "Single View Stereo Matching" (CVPR'18 Spotlight)
Depth estimation is one of the fundamental problems in computer vision. It finds important applications in a large number of areas such as robotics, augmented reality, 3D reconstruction and self-driving car, etc. This problem is heavily studied in the literature and is mainly tackled with two types of technical methodologies namely active stereo vision such as structured light, time-of-flight , and passive stereo vision including stereo matching[17, 25], structure from motion , photometric stereo  and depth cue fusion , etc. Among passive stereo vision methods, stereo matching is arguably the most widely applicable technique because it is accurate and it poses little assumption to the sensors and the imaging procedure. Recent advances in this field show that the quality of stereo matching can be significantly improved by deep models trained with synthetic data and finetuned with limited amount real data [26, 28].
On the other hand, the applicability of monocular depth estimation is greatly limited by its accuracy though the single camera setting is much more preferred in practice in order to avoid calibration errors and synchronization problems occur to the stereo camera setting. Estimating depth from a single view is difficult because it is an ill-posed and geometrically ambiguous problem. Advancement of monocular depth estimation has recently been made by deep learning methods[4, 19, 20, 23]
. However, comparing to the mentioned passive stereo vision methods which are grounded by geometric correctness, the formulation in the current state-of-the-art monocular method is problematic. The reasons are twofold. First, current deep learning approaches to this problem almost completely rely on the high-level semantic information and directly relate it to the absolute depth value. Because the operations in the network are general and do not have any prior knowledge on the function it needs to approximate, learning such semantic information is difficult even some special constraints are imposed in the loss function. Second, even the effective learning can be achieved, the relationship between scene understanding and depth needs to be established by a huge number of real data with ground truth depth. Such data is not only very expensive to obtain at scale, collecting high-quality dense labels is very difficult and time consuming if not entirely impossible. This significantly limits the potential of the current formulation.
In this paper, we take a novel perspective and show for the first time that monocular depth estimation problem can be formulated as a stereo matching problem in which the right view is automatically generated by a high-quality view synthesis network. The whole pipeline is shown in figure 1
. The key insights here are that i) both view synthesis and stereo matching respect the underlying geometric principles; ii) both of them can be trained without using the expensive real depth data and thus generalize well; iii) the whole pipeline can be collectively trained in an end-to-end fashion that optimize the geometrically correct objectives. Our method shares a similar idea as revealed in the Spatial Transformation Network. Although deep models can learn necessary transformations by themselves, it might be more beneficial for us to explicitly model such transformations. We discover that the resulting model is able to outperform all the previous methods in the challenging KITTI dataset  by only using a small number of real training data. The model also generalizes well to other monocular depth estimation datasets.
Our contributions can be summarized as follows.
First, we discover that the monocular depth estimation problem can be effectively decoupled into two sub-problems with geometrical soundness. It forms a new foundation in advancing the performance in this field.
Second, we show that the whole pipeline can be trained end-to-end and it outperforms all the previous monocular methods by a large margin using a fraction of training data. Notably, this is the first monocular method to outperform the stereo blocking matching algorithm in terms of the overall accuracy.
There exists a large body of literature on depth estimation from images, either using single view , stereo views , several overlapped images from different viewpoints , or temporal sequence . For monocular depth estimation, Saxena et al. 
propose one of the first supervised learning-based approaches to single image depth map prediction. They model depth prediction in a Markov random field and use multi-scale texture features that have been hand-crafted. Recently, deep learning has proven its ability in many computer vision tasks, including the single image depth estimation. Eigenet al.  propose the first CNN framework that predicts the depth in a coarse-to-fine manner. Laina et al.  employ a deeper ResNet  structure with an efficient up-sampling design and achieve a boosted performance. Liu et al.  also propose a deep structured learning approach that allows for training CNN features of unary and pairwise potentials in an end-to-end way. Chen et al.  provide a novel insight by incorporating pair-wise depth relation into CNN training. Compared with depth, these rankings on pixel level are much more easy to obtain. Further lines of research in supervised training of depth map prediction use the idea of depth transfer from example images [14, 15, 24], or combining semantic segmentation [3, 18, 21, 22, 36]. However, large amount of high-quality labels are in need to establish the transformation from image space to depth space. Such data are not easy to collect at scale in real life.
Recently, a small number of deep network based methods attempt to estimate depth in an unsupervised way. Garg et al.  first introduce the unsupervised method by only supervising on the image alignment loss. However, their loss is not fully differentiable so that they apply first Taylor expansion to linearize their loss for back-propagation. Godard et al.  also propose an unsupervised deep learning framework, and they employ a novel loss function to enforce consistency between the predicted depth maps from each camera view. Kuznietsov et al.  adopt a semi-supervised deep method to predict depths from single images. Sparse depth from LiDAR sensors is used for supervised learning, while a direct image alignment loss is integrated to produce photoconsistent dense depth maps in a stereo setup. Zhou et al.  jointly estimate depth and camera pose in an unsupervised manner.
Despite that those unsupervised methods reduce the demand of expensive depth ground truth, their mechanisms are still inherently problematic since they are attempting to regress a depth/disparity directly from a single image. The network architecture itself does not assume any geometric constraints and it acts like a black box. In our work, we propose a novel strategy to decompose this task into two separate procedures, namely synthesizing a corresponding right view followed by a stereo matching procedure. Such idea is similar to the Spatial Transformation Network , which learns a transformation within the network before conducting visual tasks like recognition.
To synthesize a novel view, DeepStereo  first proposes to render an unseen view by taking pixels from other views, and  predicts the appearance flow to reconstruct the target view. The Deep3D network of Xie et al.  addresses the problem of generating the corresponding right view from an input left image. Their method produces a distribution over all the possible disparities for each pixel, which is used to generate the right image.
Conducting stereo matching on the original left input and the synthetic right view is now a 1D matching problem. The vast majority of works on stereo matching focus on learning a matching function that searches the corresponding pixels on two images [17, 25]. Mayer et al.  introduce their fully convolutional DispNet to directly regress the disparity from the stereo pair. Later, Pang et al.  adopt a multi-scale residual network developed from DispNet and obtain refined results. These methods still rely on large amount labelled disparity as ground truth. Instead of using data from the real world, training on synthetic data  becomes a more feasible solution to these approaches.
In this section, we demonstrate how we decompose the task of monocular depth estimation into two separate tasks. And we illustrate our model design for view synthesis and stereo matching separately.
In our pipeline, we decompose the task of monocular depth estimation into two tasks, namely view synthesis and stereo matching. The whole pipeline is shown in figure 2. By tackling this problem using two separate steps, we find that both procedures obey primary geometric principles and they can be trained without expensive data supply. After that, these networks can be collectively trained in an end-to-end manner. We further hypothesize that, when the whole pipeline is trained end-to-end, both components will not degrade their capacity of constraining geometric correctness, and the performance of the whole pipeline will be promoted thanks to joint training. Therefore, we are desired to choose both methods that can explicitly model the geometric transformation in the network design.
The first stage is view synthesis. For a stereo pair, binocular views are rendered by well synchronized and calibrated cameras, resulting in the strong correspondence between pixels in the horizontal direction. Unlike previous warp-based methods that generally require an accurate estimation of the underlying geometry, Deep3D  proposes a new probabilistic scheme to transfer pixels from the original image. By this mean, it directly formulates the transformation from left image to right image using a differentiable selection layer. We adopt its design and develop our view synthesis network based on it. Other reconstruction plans [8, 10, 16] are also viable alternatives, but the choice of the specific view synthesis method is independent of the main insight of the paper.
After generating a high-quality novel view, our stereo matching network transforms the high-level scene understanding problem into a 1D matching problem, which results in less computational complexity. In order to better utilize the geometric relation between two views, we take the idea of 1D correlation employed in DispNetC. We further adopt the DispFullNet structure mentioned in  to achieve full resolution prediction.
Our view synthesis network is shown in the upper part of figure 2. We develop this network based on Deep3D  model. Here we briefly introduce the structure of it. At the very beginning, an input left image is processed by a baseline network. We then upsample the features from different intermediate levels to the same resolution, in order to incorporate low-level features into final use. Those features are then summed up to further produce a probabilistic disparity map. After completing a selection operation, pixels on original can be selectively mixed up to form a new pixel on the right image.
The operation of selection is the core component in this network. This module is also illustrated in figure 2. Denote as the input left image, previous Depth Image-Based Rendering (DIBR) techniques choose to directly warp the left image based on estimated disparity into a corresponding right image. Suppose is the predicted disparity aligned with the left image, the procedure can be formulated as
where is the image space of and , refer to the row and column on respectively. Though this function captures the geometric correspondence between images in a stereo setup, it requires an accurate disparity map to reconstruct the right view. At the same time, the function is not fully differentiable with respect to
which limits the opportunity of training by a deep neural network. The selection module, instead, formulates the reconstruction as a process of probabilistic summation. Denoteas the probabilistic disparity result, where and are the width and height of left image and indicates the number of possible disparity shifts, the reconstruction can then be formulated as
is the shifted left image whose stride is predetermined by possible disparity values. This operation sums up the stacked shifted input by learned weights and ensures the differentiability of the whole system.
To supervise the reconstruction quality, we do not propose any special loss function. We find that a simple L1 loss supervising on the reconstructed appearance is sufficient for the task of view synthesis:
There exists a large body of literature tackling the problem of stereo matching. Recent advancements are achieved by deep learning models. Not only because deep networks help to effectively find out similar pixel pairs, research also show that these networks can be trained on a large amount of synthetic data and they can still generalize well on real images . In our pipeline, we select the state-of-the-art DispNetC  structure as the desired network for the stereo matching task. We further follow the modifications made in  to adopt a DipFulNet structure for full-resolution output. The structure of this method can be seen in the lower part of figure 2. We briefly illustrate the method here, and the detailed settings can be found in their papers.
After processed by several convolutional operations, 1D correlation will be calculated based on resulted features. This correlation layer is found very useful in the stereo matching problem since it explicitly encodes the geometric relationship into the model design, and the horizontal correlation is indeed an effective cue for finding the most similar pairs. The features will be further concatenated with higher-level features from the left image . An encoder-decoder network further processes the concatenated features and produces disparity at different scales. These intermediate and final results will be supervised by ground truth disparity using L1 loss.
These two networks can be combined for joint training once being trained to obtain the ability of geometric reasoning for the task of view synthesis and stereo matching separately. End-to-end training of the whole pipeline can thus be performed to enforce the collaboration of these two sub-networks.
In this section, we present our experiments and results. Our method achieves state-of-the-art monocular depth estimation result on the widely used KITTI dataset . We discover and show the key insights of this method and prove the correctness of our methodology. We also make the first attempt to run our single view approach on the challenging KITTI Stereo 2015 benchmark .
We evaluate our approach on the publicly available KITTI benchmark . In order to fairly compare with other methods on monocular depth estimation, we use the raw sequences of KITTI and employ the split scheme proposed by Eigen et al. . This split results in a test set with 697 images. Remaining data is used for training and validation. Overall we have 22600 stereo pairs for training our view synthesis network. Except for stereo image pairs, the dataset also contains sparse 3D laser measurements taken from a Velodyne laser sensor. They can be projected onto image space and served as the depth labels. Parameters of the stereo setup and the camera intrinsics are also provided, therefore we can transfer depth into disparity as ground truth during end-to-end training and recover the depth from disparity during inference.
Evaluation metrics are as follows and they indicate the error and performance on predicted monocular depth.
Accuracy = % :
Here N is the number of pixels that are not empty on the depth ground truth.
To compare with other works in a consistent manner, we only evaluate on a cropped region proposed by Eigen et al. . Also, previous methods restrict the depth distance in different ranges for evaluation, we provide our result using both the cap of 0-80m (following Eigen et al. ) and 1-50m (following Garg et al. ). This requires to discard the pixels on which the depth is outside the proposed range.
The training of the model is divided into two stages. First we train the two networks used for different purposes separately. In the second stage, we combine the two parts and further finetune the whole pipeline in an end-to-end fashion. The training is conducted using caffe framework.
In the first stage, networks are trained separately. For the training of view synthesis network, 22600 stereo pairs from KITTI are taken into use. We select VGG16 as the baseline network and initialize the weights of it using the model pre-trained from ImageNet. All other weights are initialized following the same scheme in . Compared with original deep3D model , we make some modifications to make it suitable for view synthesis task on KITTI dataset. First, the size of input is larger and is selected to be . It retains the aspect ratio of original KITTI images. Second, one more convolution layer is employed before deconvolution at each branch. Third, since the disparity ranges differently in KITTI and 3D movie dataset, we change the possible disparity range. A 65-channel probabilistic map representing possible disparity from 0 to 64 now becomes the final features. Last, to accommodate larger inputs and the deeper network structure, we decrease the batch size as 2, and we remove the origin BatchNorm layers in the deep3D model. The model is trained for 200K iterations with initial learning rate equals to 0.0002. For the training of DispFullNet used for stereo matching, we follow the training scheme specified in . The model is trained mainly on the synthetic FlyingThings3D dataset  and optional finetuned on the KITTI stereo training set . This KITTI stereo training set contains 200 stereo pairs with relatively high-quality disparity labels, and it has not overlap with the test data from KITTI Eigen test set. The detailed settings can be found in Pang’s paper et al. .
In the second stage, two networks with pre-trained weights are now trained end-to-end. A small number of data from the KITTI Eigen training set with ground truth disparity labels will be taken to finetune the whole pipeline. Since the input to the stereo matching network has a larger dimension, upsample is performed inside the network to enlarge the synthetic right view resulted from the first stage.
Data augmentation is optionally done in both stages. The input will be randomly resized to a dimension slightly greater than the desired input size. And then it will be cropped into the desired size and fed into the network. The color intensity will also multiply a factor between 0.8 to 1.2.
First, the evaluation of depth estimation of the stereo matching network given perfect right images is presented. The result is shown in the Table 1, denoted as “Stereo_gt_right”. The stereo matching network clearly outperforms state-of-the-art methods for single image depth estimation, even the stereo matching network is mainly trained on rendered dataset .
The intuition here is that predicting depth from stereo images has a much higher accuracy than predicting depth by any of the previous monocular depth methods. This means we are able to achieve much higher performance if we can provide a sophisticated view synthesis module.
Next, results on the KITTI Eigen split dataset are reported when right images are predicted by our view synthesis network. Results are compared to six recent baseline methods as showed in Table 1, [4, 24] are supervised methods,  is a semi-supervised method, and [10, 38, 8] are unsupervised methods. Our proposed method is also a semi-supervised method.
Result without end-to-end finetuning: After the training of both networks converged, we directly feed the right image synthesized by the view synthesis network to the stereo matching network to predict the depth for the given left images. The result is reported in Table 1.
As one can see, even without finetuning the whole network in KITTI dataset, our method performs better than the unsupervised method , and gets comparable performance with the state-of-the-art semi-supervised method . The performance achieved by our method demonstrates that decoupling the problem of monocular depth estimation into two separate sub-problems is simple yet effective by explicitly enforcing geometrics constraints, which is critical for estimating depth from images.
Result with end-to-end finetuning: We further finetune the whole system with a small amount of training data from KITTI Eigen split training set, i.e. 700 training samples. The left, right images and the depth images are used as training samples to our proposed method.
The results are reported in Table 1, as one can see, our method outperforms all compared methods, with ARD metric reduced by 17.5% compared with Godard et al.  and 16.8% compared with Kuznietsov et al.  at the cap of 80 m. Our proposed method performs the best for almost all metrics. It shows that end-to-end training further optimizes the collaboration of these two sub-networks and it leads to the state-of-the-art result. Qualitative comparisons are shown in Figure 3. Our proposed method also achieves much more visually accurate estimations than the compared methods.
Qualitative results on the KITTI Eigen test set. Sparse ground-truth labels have been interpolated for visualization. Note that the prediction of our method can better separate the background and foreground or different entities close to each other. Also, our results are crisper and neater. In addition, we are doing better on the objects such as trees, poles, traffic sign and pedestrians, whose depth are generally hard to be inferred accurately.
In this section, we analyze the function of two sub-networks after end-to-end training. If the end-to-end training breaks the origin functionality of the two sub-networks but the overall performance increases, the whole network would be overfitted to the KITTI dataset, which will make it hard to generalize to other datasets or scenes. To examine the function of two sub-networks, we conduct the following two groups of experiments.
Analyzing function of view synthesis sub-network: We replaced the stereo matching sub-network in the finetuned network with the one before finetuneing. Since pre-trained stereo matching sub-network is only pre-trained to complete the stereo matching task using real left-right pairs, if after replacing, the whole network could still get good performance in the task of single image depth estimation, the origin functionality of the view synthesis network after the finetuning process could still be retained.
The results are reported in top three rows of Table 2, denoted as “Finetuned_synthesis_K”, where K represents the number of training samples. As one can see from Table 2, the results by “Finetuned_synthesis_K” outperform the method without finetune. From another perspective, the average PSNR between synthesized views and ground truth views in test set increases from 21.29dB to 21.32dB after finetuning. The preservation of functionality may be due to the reason that during the finetuning process, the stereo matching sub-network acts as another loss to better constrain the view synthesis sub-network to generate geometric-reasonable right images.
Analyzing function of stereo matching sub-network: In order to validate the function of stereo matching sub-network after end-to-end training, we test the stereo matching performance of the finetuned stereo matching sub-network by providing the true left and right image as inputs to predict the depth.
The results are provided in the middle three rows of Table 2, denoted as “Finetuned_stereo_gt_right_K”. As shown in Table 2, “Finetuned_stereo_gt_right_200” performs slightly worse than “Finetuned_stereo_gt_right_0”, this may be due to the reason that the finetuning process has forced the stereo matching sub-network to better fit on the imperfect synthesized right images. However, “Finetuned_stereo_gt_right_700” outperforms the pre-trained stereo matching sub-network. The high performance of stereo matching results clearly demonstrates the stereo matching network still maintains its functionality after end-to-end finetuned.
Combining the above two experiment groups, we could conclude that after end-to-end training, the two sub-modules collaborate more effectively while preserving their individual functionalities. This may imply that our proposed method could generalized well to other datasets. Some qualitative results on Cityscape dataset  and Make3D dataset  are shown in Figure 5, which are estimated by our method finetuned in KITTI dataset. The results demonstrate the generalization ability of our proposed method on unseen scenes.
Our view synthesis network produces a primitive disparity in order to do the rendering. The middle part in table 3 shows the estimation accuracy calculated from this probabilistic disparity map. We can see the result is much inferior to the final result of our proposed method. It shows our approach indeed makes a great improvement over the primitive disparity.
To study the effectiveness of our proposed method, we also evaluate our proposed method finetuned by different numbers of samples, i.e., 0, 200, 500, 700, named as “Finetune-K”. Note that, when K equals to 0, finetuning is not performed on the whole network.
The results are reported in the bottom three rows of Table 2. As one can see from the results, more end-to-end finetuning samples could achieve higher performance, and our proposed method could outperform previous state-of-the-art methods by a clear margin only using 700 samples to finetune the whole network.
As described before, we use 200 high-quality KITTI labels to optionally finetune the stereo matching network. In the lower part of table 3, we present the result without these labels before and after finetune(_BF&_AF). We can see that without seeing any real disparity from KITTI, our method already gets promising results. After finetuning without those high-quality labels, our method still beats the current state-of-the-art method. These high-quality labels, in fact, increase the capacity of the model to a certain extent, but without them, our method still makes an improvement under the same condition.
|Godard et al. ||27.00||28.24||27.21|
In this section, the comparisons with the proposed approach for depth estimation from single images and stereo matching method from stereo images are presented. The results are summarized in Table 4. As one can see, our method is the first single image depth estimation approach that surpasses the traditional stereo matching method, i.e. block matching method denoted as “OCV-BM” in the table. Exemplar visual results are shown in Fig. 4. Because the block matching method directly using low-level image feature to search the matched pixels in the left and right images, the disparity maps predicted by the block matching method are usually noised, which greatly degrades its performance, but the results are still geometrically correct. The geometric reasoning capacity is built in our network and high-level image feature is processed in the deep learning network, these two reasons enable our method to outperform the stereo matching method. Due to the miss of explicit geometric constraints in Godard et al. , its method gets sub-optimal results. Better performance of our method can be seen from the box regions in the figure.
In this work, we propose a novel perspective to tackle the problem of monocular depth estimation. We show for the first time that this problem can be decomposed into two problems, namely a view synthesis problem and a stereo matching problem. We explicitly encode the geometric transformation within both networks to better tackle the problems individually. Collectively training the whole pipeline results in an overall boost and we prove that both networks are able to preserve their original functionality after end-to-end training. Without using a large amount of expensive ground truth labels, we outperform all previous methods on a monocular depth estimation benchmark. Remarkably, we are the first to outperform the stereo blocking matching algorithm on a stereo matching benchmark using a monocular method.
Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs.In CVPR, 2015.
Deep3d: Fully automatic 2d-to-3d video conversion with deep convolutional neural networks.In ECCV, 2016.