Semi-dense Stereo Matching using Dual CNNs

by   Wendong Mao, et al.

A robust solution for semi-dense stereo matching is presented. It utilizes two CNN models for computing stereo matching cost and performing confidence-based filtering, respectively. Compared to existing CNNs-based matching cost generation approaches, our method feeds additional global information into the network so that the learned model can better handle challenging cases, such as lighting changes and lack of textures. Through utilizing non-parametric transforms, our method is also more self-reliant than most existing semi-dense stereo approaches, which rely highly on the adjustment of parameters. The experimental results based on Middlebury Stereo dataset demonstrate that the proposed approach outperforms the state-of-the-art semi-dense stereo approaches.



There are no comments yet.


page 2

page 4

page 5

page 7

page 8


A Comparison of Stereo-Matching Cost between Convolutional Neural Network and Census for Satellite Images

Stereo dense image matching can be categorized to low-level feature base...

High-Precision Online Markerless Stereo Extrinsic Calibration

Stereo cameras and dense stereo matching algorithms are core components ...

Leveraging Spatial and Photometric Context for Calibrated Non-Lambertian Photometric Stereo

The problem of estimating a surface shape from its observed reflectance ...

Confidence Inference for Focused Learning in Stereo Matching

In this paper, we present confidence inference approachin an unsupervise...

Fully Parallel Architecture for Semi-global Stereo Matching with Refined Rank Method

Fully parallel architecture at disparity-level for efficient semi-global...

Using Orthophoto for Building Boundary Sharpening in the Digital Surface Model

Nowadays dense stereo matching has become one of the dominant tools in 3...

Learning Inter- and Intra-frame Representations for Non-Lambertian Photometric Stereo

In this paper, we build a two-stage Convolutional Neural Network (CNN) a...

Code Repositories


Track Advancement of SLAM 跟踪SLAM前沿动态【IROS 2019 SLAM updated】

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Humans rely on binocular vision to perceive 3D environments. Even though it is a passive system, our brains can still estimate 3D information more rapidly and robustly than many active or passive sensors that have been developed. One of the reasons is that brains can utilize prior knowledge to understand the scene and to infer the most reasonable depth hypothesis even when the visual cues are lacking. Recent advances in machine learning have shown that the brain’s discrimination power can be mimicked using deep convolutional neural networks (CNNs). Hence, one has to wonder how CNNs can be used to enhance traditional stereo matching algorithms.

Approaches have been proposed for generating matching cost volumes (a.k.a. disparity space images) using CNNs [21, 39, 42]

. While inspiring results are generated, these existing approaches are not robust enough for handling challenging and ambiguous cases, such as lighting changes and lack of textures. Heuristically defined post-processing steps are often applied to correct mismatches. Our hypothesis is that the performance of CNNs can be noticeably improved if more information is fed into the network. Hence, instead of trying to correct mismatches as post-processing, we introduce, in the pre-processing step, image transforms that are robust against lighting changes and can add distinguishable patterns to textureless areas. The output of these transforms are used as additional information channels, together with grayscale images, for training a matching CNN model.

The experimental results show that the model learned can effectively separate correct stereo matches from mismatches so that accurate disparity maps can be generated using the simplest Winner-Take-All (WTA) optimization.

Learning-based approaches were also proposed to compute confidence measure for generated disparity values so that mismatches can be filtered out [4, 32, 39]. Following this idea, a second CNN model is designed to evaluate the disparity map generated through WTA. Trained with only one input image and the disparity map, this evaluation CNN model can effectively filter out mismatches and produce accurate semi-dense (a.k.a. sparse) disparity maps.

Figure 1 shows the pipeline of the whole process. Since both matching cost generation and disparity confidence evaluation are performed using learning-based approach, the algorithm contains very few handcrafted parameters. The experiment results on Middleburry 2014 stereo dataset [29] demonstrate that the present dual-CNN algorithm outperforms most existing sparse stereo techniques.

Figure 1: Semi-dense stereo matching pipeline. Given a pair of rectified images, how well a pair of image patches match is evaluated using a matching-CNN model. The results form a matching cost volume, from which a disparity map is generated using simple WTA optimization. Finally, an evaluation-CNN model is applied to filter out mismatches.

2 Related Work

Stereo matching algorithms can be briefly categorized into two classes: dense and sparse stereo matching [18, 30]. Dense stereo matching algorithms assign disparity values to all pixels, whereas sparse matching approaches only output disparities for pixels with sufficient visual cues. A typical pipeline implemented in most dense and sparse stereo matching algorithms consists of matching cost computation, cost aggregation, disparity computation and optimization, and refinement steps [30]. Here, we focus our discussion on sparse stereo matching.

An early work of sparse stereo matching was implemented by calculating disparities for distinctive points first and gradually generating disparity values for the remaining pixels [19]. Graph cuts was later introduced in [37] to detect textured areas as an alternative to unambiguous points and generate corresponding semi-dense results. By design, their approach can filter out mismatches caused by lack of textures, but not by occlusions, etc. This limitation is addressed in Semi-Global Matching (SGM) [11], in which multiple 1D constraints were used to generate accurate semi-dense results based on peak removal and consistency checks. Gong and Yang [7] proposed a reliability measure to detect potential mismatches from disparity maps generated using Dynamic Programing (DP). This work was later extended and implemented on graphics hardware for real-time performance [8]. Psota et al. [28] utilized Hidden Markov Trees (HMT) to create minimum spanning trees based on color information which allows aggregated costs to be passed along the tree branches, and the isolated mismatches were later moved by median filtering.

Generally, the above algorithms utilize additional constraints in the cost aggregation or/and disparity computation step to improve the accuracy of sparse stereo matching. Instead of designing new constraints or assumptions, we train CNNs for both generating aggregated cost volumes and detecting potential mismatches. The disparity computation step, on the other hand, is performed using the simplest WTA approach.

2.1 Stereo Matching Cost

Traditionally, sum of absolute differences (SAD), sum of squared differences (SSD), and normalized cross-correlation (NCC) had been commonly used for calculating and aggregating matching costs [30]. These window-based matching techniques that rely on the local intensity values may not behave well near discontinuities in disparity. Zabih and Woodfil [40] therefore proposed two non-parametric local transforms, referred to as rank and census transforms, to address the correspondences at the boundaries of objects. A recent attempt tried to combine different window-based matching techniques for stereo matching [1].

In the past few years, various works were proposed to generate matching cost volumes that can better differentiate correct matches from mismatches. The most encouraging direction is using ground-truth data [10, 34, 36, 41] to train various neural networks to learn local image features. More recent works mostly opted for CNNs trained by ground-truth data to predict the likeness for each potential match based on fixed windows as in [42]. The matching costs were set using the output of CNNs directly.

Due to the successful practice of using CNNs, stereo matching algorithms have been progressively improved over the past three years. Zhang et al. [43] used CNNs and SGM to generate initial disparity maps and further combine Left Right Difference (LRD) [12]

with disparity distance regarding local planes to perform confidence check. In addition, they adopted segmentation and surface normal within the post-processing to enhance the reliability of disparity estimation. To fully utilize the ability of CNNs in terms of feature extraction, Park and Lee

[21] proposed a revised CNN model based on a large pooling window between convolutional layers for wider receptive fields to compute the matching cost, and they performed similar post-processing pipeline introduced in [42]. Another model revision, similar to Park and Lee’s work [21], was introduced by Ye et al. [39], which used a multi-size and multi-layer pooling scheme to take wider neighboring information into consideration. Moreover, a disparity refinement CNN model was later demonstrated in their post-processing to blend the optimal and suboptimal disparity values. Both the above revisions presented solid results in image areas with low or devoid of texture, disparity discontinuities and occlusions.

Attempts were also made to train end-to-end deep learning architectures for predicting disparity maps from input images directly, without the needs of explicitly computing the matching cost volume 

[3, 14, 20]. As a result, these end-to-end models are efficient but require larger amount of GPU memory than the previous patch-based approaches. More importantly, these models were often trained on stereo datasets with specific image resolutions and disparity ranges and hence, cannot be applied to other input data. They also restrict the feasibility of training CNNs to concurrently preserve geometric and semantic similarity proposed in [5, 31, 38].

2.2 Confidence Measure

Once dense disparity results are generated, confidence measures can be applied to filter out inaccurate disparity values in the disparity refinement step. Quantitative evaluations on traditional confidence measures were presented by Hu and Mordohai [12], and the most recent review was given by Poggi et al. [27]. We hereby mainly presented a few significant works.

Haeusler et al. [9] proposed a random decision forest framework, which combines multivariate confidence measures to improve error detection. Park and Yoon [22] suggested a regression forest framework for selecting effective confidence measures. Based on SGM [11], Poggi and Mattoccia [24] used O(1) features and machine learning to implement an improved scanline aggregation strategy, which performs streaking detection on each path in [11] to perform confidence measure. They further proposed using a CNN model to enforce local consistency on confidence maps [26]. Recently CNNs have been applied to confidence measure. Our approach is similar to these recent works [4, 32, 39], which compute confidences through training 2D CNNs on 2D image or/and disparity patches. A key difference is, however, that only the left image and its raw disparity map generated by WTA are used to train our confidence CNN model, whereas existing approaches require to generate both left and right disparity maps.

3 Methodology

We aim at developing a robust and learning-based stereo matching approach. We observed that, for many applications, it is more important to ensure the accuracy of output than to generate disparity values for all pixels. Hence, we here focus on assigning disparity values only for pixels with sufficient visual cues.

As shown in Figure 1, two CNN models, referred as matching-Net and evaluation-Net, are utilized in our approach: matching-Net is constructed as the substitution of matching cost computation and aggregation steps, and outputs matching similarity for each pixel pairs; evaluation-Net performs confidence measure on the raw disparity maps generated by WTA based on the similarity scores.

Figure 2: Comparison between the baseline architecture “MC-CNN-arct” in [42]

and the proposed model matching-Net. The left and right image patches for the latter are selected from an image collection, which includes not only grayscale images, but also channels generated by non-parametric transforms. In addition, the concatenation operation is replaced by 3D convolution, which can separately group different transforms by adjusting stride size in the third dimension; see Section

4 for model configuration.

3.1 Matching-Net

Our matching-CNN model serves the same purpose as the “MC-CNN-arct” in [42], but there are several key differences; see Figure 2. First of all, we choose to feed the neural network with additional global information (i.e., results of non-parametric transformations) that are difficult to generate through convolutions. Secondly, 3D convolution networks are employed, which we found can improve the performance. It is worth noting that our approach is also different from other attempts to improve “MC-CNN-arct”, which use very large image patches and multiple pooling sizes [21, 39]. These approaches require extensive amount of GPU memory, which limits their usage. In order to feed global information into the network trained on small patches, our strategy is to perform non-parametric transforms.

3.1.1 Lighting Difference

For robust stereo matching, lighting difference as an external factor cannot be neglected. To address this factor, “MC-CNN-arct” manually adjusted the brightness and contrast of image pairs to generate extra data for training. However, datasets with lighting difference may vary from one to another, making it hard to train a model that is robust against to all cases.

Aiming for an approach with less human intervention, here we propose using rank transform to tolerate lighting variations between image pairs. As a non-parametric local transform, rank transform was first introduced by Zabih  [40] to achieve better visual correspondence near discontinuities in disparity. This endows stereo algorithms based on rank transform with the capability to perform similarity estimation for image pairs with different lighting conditions.

(a) Grayscale
(e) Grayscale
Figure 3: Results of “MotorE” dataset (left image on top row and right image on bottom) using rank transform under different neighborhood sizes. Lager windows generally lead to smoother results, but at the expense of losing subtle information.

The rank transform for pixel in image is computed as:


where is a set containing pixels within a square window centered at . is the size of set S. Figure 3 shows the results of rank transform under different window sizes.

(a) Grayscale
(e) Grayscale
Figure 4: Results of companion transform under different neighborhood sizes for the “Jadepl” dataset (left image on top row and right image on bottom). The transformation results are brightened for better viewing. The results show that companion transform successfully adds distinguishable features to texture-less areas; see regions highlighted.

3.1.2 Low Texture

Besides lighting variations, low or devoid of texture poses another challenge for stereo matching. For a given pixel within texture-less regions, the best way to estimate its disparity is based on its neighbors who have similar depth but are in texture-rich areas (have sufficient visual cues for accurate disparity estimation). Traditional stereo algorithms [30] mostly utilize cost aggregation, segmentation-based matching, or global optimization for disparity computation to handle ambiguous regions. As mentioned above, our intention is to feed the neural networks with global information. Hence, a novel companion transform is designed and applied in the pre-processing step.

The idea of companion transform is inspired by SGM [11], which suggests performing smoothness control by minimizing an energy function on 16 directions. In our case, we want to design a transformation that can add distinguishable features to texture-less area. Hence, for a given pixel , we choose to count the number of pixels that: 1) have the same intensity as and 2) lie on one of the rays started from . We refer these pixels as ’s companions and the transform as companion transform. In practice, we found 8 ray directions (left, right, up, down, and 4 diagonal directions) work well, though other settings (4 or 16 directions) can also be used.


where is a set containing pixels on the rays started from .

Figure 5: Comparison among information carried in different channels (grayscale, rank transform, and companion transform). The curves are plotted based on the values of different pixels on the same row marked in blue in Figure 4. Left side shows the left view, with the position of target pixel marked by red vertical lines. Right side shows the right view, where the red line shows the position of the correct corresponding pixel of . Due to the lack of textures, neither the grayscale nor the rank transform channels provide distinguishable pattens for matching. The companion transform can amend information that is useful for the matching-CNN.

Figure 4 shows the results of companion transform under different window sizes. Figure 5 further illustrates how the companion transform result adds distinguishable pattens to a pixel in texture-less area.

3.1.3 Training Data

To train our CNN model, the 15 image pairs from Middleburry 2014 stereo training dataset [29], which contains examples for lighting variations and texture-less areas, are utilized. Each input image is first converted to grayscale before applying rank and companion transforms. The outputs of the two transforms, together with the grayscale images, form multi-channel images. Each training sample contains a pair of image patches centered at pixel in left image and in right image, respectively. The input sample is assembled into a 3D matrix , where is the size of the patches and is the number of channels in the multi-channel image. The ground-truth disparity values provided by the dataset are used to select matched samples and random disparity values other than the ground truth are used to generate mismatched samples. Similar to [42], we sample matching hypotheses so that the same number of matches and mismatches are used for training. The proposed matching-Net is then trained to output value “0” for correct matches and “1” for mismatches.

3.2 Disparity Computation

For each novel stereo image pair, the matching-Net trained above is used to generate a 3D cost volume , where the value at location stores the cost of matching pixel in the left image with in the right image. The higher the value, the more likely the corresponding pair of pixels are mismatches since the network is trained to output “1” for mismatches. Unlike many existing approaches that resort to complex and heuristically designed cost aggregation and disparity optimization approaches [30], here we rely on the learning network to distinguish correct matches from mismatches. Expecting the correct matches to have the smallest values in the cost volume , the simplest WTA optimization is applied to compute the raw disparity map.


3.3 Evaluation-Net

The matching-net is trained to measure how well two images patches, one from each stereo image, match. It makes decision locally and does not check the consistency among best matches found for neighboring pixels. When the raw disparity maps are computed by local WTA, they inevitably contain mismatches, especially in occluded and low-textured areas. To filter out these mismatches, we construct another CNN model, evaluation-Net, to implement consistency check and perform confidence measure.

Learning-based confidence measures have been successfully applied on detecting mismatches and further improving the accuracy of stereo matching [27]. Similar to the 2D CNN model for error detection proposed in [39], only left images and their disparity maps are selected to train our model. A key difference, however, is that no handcrafted operation is involved in our approach to fuse left and right disparity maps. In addition, the network contains both 2D and 3D convolutional layers to effectively identify mismatches from disparity maps; see Figure 6. 3D convolution is adopted here to allow the network learn from the correlation between pixels’ intensities and disparity values.

Figure 6: Architecture used for the evaluation-Net. The image patches here are generally bigger than the ones used in the matching-Net. Therefore, multiple pooling layers are added for efficiency. Detailed model configuration can be found in Section 4.

The evaluation-Net is trained using both matches and mismatches in the estimated disparity maps for all training images. Mismatches are identified by comparing with ground-truth disparity maps . Here, a pixel is considered as mismatched iff.


where is a threshold value commonly assigned with pixel; see Figure 7(b-c).

In the estimated disparity map , the majority pixels have correct disparity values, resulting in much more positive (accurately matched) samples than negative (mismatch) samples. Hence, we collect and use all negative samples and randomly generate the same number of positive samples. For each selected sample , we extract grayscale and estimated disparity values from patches centered at to form a matrix. The evaluation-Net is then trained to output value “0” for negative samples and “1” for positive samples. The output of the evaluation-Net can then be used to filter out potential mismatches which achieve scores lower than a confidence threshold .

(a) raw disparity
(b) mismatches
(c) matches
Figure 7:

Training samples. Pixels in a given disparity map (a) is classified into mismatches (b) and accurate matches (c) using ground-truth disparity.

4 Experimental Results

In this section, we present the “hyperparameters

[42] for both of the proposed CNN models, which are followed by a set of performance evaluations. The goal of the evaluations is to find out: 1) whether the non-parametric transforms can help improving the disparity map accuracy generated using the matching-Net; and 2) how well the overall dual-CNN approach performs compared to the state-of-the-art sparse stereo matching techniques.

matching-Net evaluation-Net
Attributes Kernel size, quantity   Stride size Attributes Kernel size, quantity   Stride size
Input , 1 Input , 1
Conv1(2D) , 32   Conv1(2D) , 16
Conv2(3D) , 128   Mp1 , 16
Conv3(3D) , 64   Conv2(2D) , 32
FC1 1600   Mp2 , 32
FC2 128   Conv3(2D) , 64
Output 2   Mp3 , 64
- -   Conv4(3D) , 128
- -   Mp4 , 128
- -   FC1 128
- -   Output 2

Table 1:

Hyperparameters of the matching-Net and evaluation-Net. Here, “Conv”, “Mp” and “Fc” denote convolutional layer, max pooling layer, and fully connected layers respectively.

Hyperparameters and implementations: The input of the matching-Net is a 3D matrix that consists of layers in our experiment. Both left and right images contains layers, including the grayscale image, a rank transform (), and a companion transform () respectively. Different layers from the left and right images are stored in the matrix in alternating order. For the evaluation-Net, the input contains only two layers of data: one is the grayscale image and the other the raw disparity map, both from the left image. Table 1 shows the hyperparameters of our experimental models.

The implementation of our CNN models are based on Tensorflow using classification cross-entropy loss,

, where denotes the output value. Here, we set for mismatches and for matches to train the matching-Net as in “MC-CNN-acrt”, but for positive samples and

for negative samples to perform confidence measure through the evaluation-Net. Both models utilize a gradually decreasing learning rate from 0.002 to 0.0001, and arrive a stable state after running 20 epochs on full training data.

Figure 8: Comparison on dense disparity maps generated by “MC-CNN-acrt” (b) and matching-Net (c). Top stereo image pair (“ArtL”) contains lighting condition changes, whereas the bottom one (“Recyc”) contains areas with low texture. Thanks to the rank and companion transforms, the disparity maps generated by our approach are much smoother and have fewer mismatches.

Effectiveness of non-parametric transforms: The overall structures of “MC-CNN-acrt” and matching-Net are quite similar. The key difference is that the input patches of “MC-CNN-acrt” are grayscale images only, whereas our matching-Net uses additional non-parametric transforms. Hence, to evaluate the effectiveness of non-parametric transforms, we here compare the raw disparity maps generated by the two approaches. Based on the same training dataset from Middlebury [29], Figure 8 visually compares the raw disparity maps generated by “MC-CNN-acrt” and matching-Net. It suggests that the additional transforms allow the network to better handle challenging cases. Our raw disparity maps achieves compared to of “MC-CNN-acrt” regarding the mean percentage error (MPE) (over -pixel difference for half resolution) of non-occlusion areas.

(a) training
(b) testing
Figure 9: Comparison with the top ten approaches on the Middlebury Stereo Evaluation site [29]: SED [23], R-NCC (unpublished work), r200high [15], ICSG [33], SGM [11], DF (unpublished work), MotionStereo (unpublished work), IDR [17], TMAP [28] and SNCC [6]. Performances of different approaches on both training (a) and testing (b) datasets are plotted on non-occlusion error rates v.s. invalid pixels rates plot. The relative position of these approaches on the two datasets are similar. On training datasets, where the ground truth disparity maps are available, we show the performance of our approach under different confidence threshold settings as a curve.
Figure 10: Comparison of sparse disparity maps regarding “Austr” and “ClassE” (with lighting variation) of the testing dataset on the Middlebury Stereo Evaluation site [29]. First column shows the ground truth, and columns 2 to 8 are the disparity maps generated by DCNN, TMAP [28], IDR [17], SGM [11], R-NCC (unpublished work), r200high [15] and ICSG [33] respectively.

Comparison with sparse stereo matching approaches: Almost all state-of-the-art sparse stereo matching approaches have submitted their results to Middlebury evaluation site [29]. Our approach (referred as “DCNN”) on “test sparse” currently ranks the under the “bad 2.0” category. We would like to emphasize that simply comparing error rates of sparse disparities maps does not offer the whole picture on algorithm performance as it favors approaches that output fewer disparity values (a.k.a. more invalid pixels). For a fair comparison, a non-occlusion error rates v.s. invalid pixels rates plot is used to show the performance of different approaches on both the training and testing datasets; see Figure 9. The comparison suggests that our approach under setting provides a very good balance between output disparity density and disparity accuracy. In addition, the plot on the training dataset also shows that, under the same output disparity density, our approach provides lower non-occlusion error rates than existing approaches. Figure 10 further visually compares the disparity maps generated by different approaches.

The root-mean-square (RMS) metric [29] is also used here for evaluation. Since square errors are used, the RMS metric provides stronger penalization to large disparity errors than the average absolute error (“avgerr”) metric. Our approach on the testing dataset currently ranks on the top under the “rms” category; see Table 2.

Name RMS Name RMS
R-NCC(unpublished) INTS [13]
IDR [17] SGM [11]
Table 2: Comparisons of the state-of-the-art approaches under the RMS metric.

AUC evaluation: The Area Under the Curve (AUC) metric introduced by Hu and Mordohai [12] has been used as a metric for evaluating various confidence measures over the past few years. It measures how effectively the confidence measures can filter out mismatches under different parameter settings, rather than only checking the performance under one set of parameters. Since a large set of sparse disparity maps need to be evaluated, this measure can only be computed on datasets with published ground truth. Following the practice in [35], we train our dual-CNN approach only on the 13 additional image pairs with ground truth from Middlebury [29] and then test it on the 15 training image pairs. Our approach achieves a competitive mean AUC value compared to 0.0728, 0.0680 and 0.0637 attained respectively by the state-of-the-art approaches APKR [16], O1 [24] and CCNN [25] reported in [35], which compares various confidence measures on the raw disparity maps from [42].

5 Conclusion

A novel learning based semi-dense stereo matching algorithm is presented in this paper. The algorithm employs two CNN models. The first CNN model evaluates how well two image patches match. It serves the same purpose as “MC-CNN-acrt”, but takes additional rank and companion transforms as input. These two transforms introduce global information and distinguishable patterns into the network; and hence areas with lighting changes and/or lack of textures can be more accurately matched. As a result, the optimal disparity values can be computed using the simplest WTA optimization. No complicated global disparity optimization algorithms or additional post-processing steps are required. The second CNN model is used for evaluating the disparity values generated and filter out mismatches. Taking only one of the stereo images and the disparity map as input, the evaluation-Net can effectively label mismatches, without the needs for heuristically designed process such as left-right consistency check and median filtering.

Our work suggests that, once sufficient information is fed to the network, CNN-based models can effectively predict the correct matches and detect mismatches. For the future work, we plan to investigate how to reduce the training and labeling costs so that the algorithm can be applied to real-time applications. We also plan to apply the algorithm to multi-view stereo matching for 3D reconstruction applications.


  • [1] K. Batsos, C. Cai, and P. Mordohai. Cbmv: A coalesced bidirectional matching volume for disparity estimation. arXiv preprint arXiv:1804.01967, 2018.
  • [2] J.-C. Bricola, M. Bilodeau, and S. Beucher. Morphological processing of stereoscopic image superimpositions for disparity map estimation. working paper or preprint, Mar. 2016.
  • [3] J.-R. Chang and Y.-S. Chen. Pyramid stereo matching network. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 5410–5418, 2018.
  • [4] F. Cheng, X. He, and H. Zhang. Learning to refine depth for robust stereo estimation. Pattern Recognition, 74:122–133, 2018.
  • [5] C. B. Choy, J. Gwak, S. Savarese, and M. Chandraker. Universal correspondence network. In Advances in Neural Information Processing Systems, pages 2414–2422, 2016.
  • [6] N. Einecke and J. Eggert. A two-stage correlation method for stereoscopic depth estimation. In Digital Image Computing: Techniques and Applications (DICTA), 2010 International Conference on, pages 227–234. IEEE, 2010.
  • [7] M. Gong and Y.-H. Yang. Fast stereo matching using reliability-based dynamic programming and consistency constraints. In ICCV, pages 610–617, 2003.
  • [8] M. Gong and Y.-H. Yang. Near real-time reliable stereo matching using programmable graphics hardware. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 924–931. IEEE, 2005.
  • [9] R. Haeusler, R. Nair, and D. Kondermann. Ensemble learning for confidence measures in stereo vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 305–312, 2013.
  • [10] X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg. Matchnet: Unifying feature and metric learning for patch-based matching. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 3279–3286. IEEE, 2015.
  • [11] H. Hirschmuller. Stereo processing by semiglobal matching and mutual information. IEEE Transactions on pattern analysis and machine intelligence, 30(2):328–341, 2008.
  • [12] X. Hu and P. Mordohai. A quantitative evaluation of confidence measures for stereo vision. IEEE transactions on pattern analysis and machine intelligence, 34(11):2121–2133, 2012.
  • [13] X. Huang, Y. Zhang, and Z. Yue. Image-guided non-local dense matching with three-steps optimization. ISPRS Annals of Photogrammetry, Remote Sensing & Spatial Information Sciences, 3(3), 2016.
  • [14] A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry. End-to-end learning of geometry and context for deep stereo regression. In Proceedings of the International Conference on Computer Vision (ICCV), 2017.
  • [15] L. Keselman, J. I. Woodfill, A. Grunnet-Jepsen, and A. Bhowmik. Intel realsense stereoscopic depth cameras. arXiv preprint arXiv:1705.05548, 2017.
  • [16] S. Kim, D.-g. Yoo, and Y. H. Kim. Stereo confidence metrics using the costs of surrounding pixels. In Digital Signal Processing (DSP), 2014 19th International Conference on, pages 98–103. IEEE, 2014.
  • [17] J. Kowalczuk, E. T. Psota, and L. C. Perez. Real-time stereo matching on cuda using an iterative refinement method for adaptive support-weight correspondences. IEEE Transactions on Circuits and Systems for Video Technology, 23(1):94–104, Jan 2013.
  • [18] N. Lazaros, G. C. Sirakoulis, and A. Gasteratos. Review of stereo vision algorithms: from software to hardware. International Journal of Optomechatronics, 2(4):435–462, 2008.
  • [19] R. Manduchi and C. Tomasi. Distinctiveness maps for image matching. In Image Analysis and Processing, 1999. Proceedings. International Conference on, pages 26–31. IEEE, 1999.
  • [20] J. Pang, W. Sun, J. S. Ren, C. Yang, and Q. Yan. Cascade residual learning: A two-stage convolutional neural network for stereo matching. In ICCV Workshops, volume 7, 2017.
  • [21] H. Park and K. M. Lee. Look wider to match image patches with convolutional neural networks. IEEE Signal Processing Letters, 24(12):1788–1792, 2017.
  • [22] M.-G. Park and K.-J. Yoon. Leveraging stereo matching with learning-based confidence measures. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 101–109, 2015.
  • [23] D. Peña and A. Sutherland. Disparity estimation by simultaneous edge drawing. In C.-S. Chen, J. Lu, and K.-K. Ma, editors, Computer Vision – ACCV 2016 Workshops, pages 124–135, Cham, 2017. Springer International Publishing.
  • [24] M. Poggi and S. Mattoccia. Learning a general-purpose confidence measure based on o (1) features and a smarter aggregation strategy for semi global matching. In 3D Vision (3DV), 2016 Fourth International Conference on, pages 509–518. IEEE, 2016.
  • [25] M. Poggi and S. Mattoccia. Learning from scratch a confidence measure. In BMVC, 2016.
  • [26] M. Poggi and S. Mattoccia. Learning to predict stereo reliability enforcing local consistency of confidence maps. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, 2017.
  • [27] M. Poggi, F. Tosi, and S. Mattoccia. Quantitative evaluation of confidence measures in a machine learning world. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), volume 206, page 17, 2017.
  • [28] E. T. Psota, J. Kowalczuk, M. Mittek, and L. C. Perez. Map disparity estimation using hidden markov trees. In Proceedings of the IEEE International Conference on Computer Vision, pages 2219–2227, 2015.
  • [29] D. Scharstein, H. Hirschmüller, Y. Kitajima, G. Krathwohl, N. Nešić, X. Wang, and P. Westling. High-resolution stereo datasets with subpixel-accurate ground truth. In German Conference on Pattern Recognition, pages 31–42. Springer, 2014.
  • [30] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International journal of computer vision, 47(1-3):7–42, 2002.
  • [31] T. Schmidt, R. Newcombe, and D. Fox. Self-supervised visual descriptor learning for dense correspondence. IEEE Robotics and Automation Letters, 2(2):420–427, 2017.
  • [32] A. Seki and M. Pollefeys. Patch based confidence prediction for dense disparity map. In E. R. H. Richard C. Wilson and W. A. P. Smith, editors, Proceedings of the British Machine Vision Conference (BMVC), pages 23.1–23.13. BMVA Press, September 2016.
  • [33] M. Shahbazi, G. Sohn, J. Théau, and P. Ménard. Revisiting intrinsic curves for efficient dense stereo matching. ISPRS Annals of Photogrammetry, Remote Sensing & Spatial Information Sciences, 3(3), 2016.
  • [34] K. Simonyan, A. Vedaldi, and A. Zisserman. Learning local feature descriptors using convex optimisation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(8):1573–1585, 2014.
  • [35] F. Tosi, M. Poggi, A. Tonioni, L. Di Stefano, and S. Mattoccia. Learning confidence measures in the wild. In 28th British Machine Vision Conference (BMVC 2017), volume 2, 2017.
  • [36] T. Trzcinski, M. Christoudias, and V. Lepetit. Learning image descriptors with boosting. IEEE transactions on pattern analysis and machine intelligence, 37(3):597–610, 2015.
  • [37] O. Veksler. Extracting dense features for visual correspondence with graph cuts. In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, volume 1, pages I–I. IEEE, 2003.
  • [38] C. S. Weerasekera, R. Garg, and I. Reid. Learning deeply supervised visual descriptors for dense monocular reconstruction. arXiv preprint arXiv:1711.05919, 2017.
  • [39] X. Ye, J. Li, H. Wang, H. Huang, and X. Zhang. Efficient stereo matching leveraging deep local and context information. IEEE Access, 5:18745–18755, 2017.
  • [40] R. Zabih and J. Woodfill. Non-parametric local transforms for computing visual correspondence. In Proceedings of the Third European Conference-Volume II on Computer Vision - Volume II, ECCV ’94, pages 151–158, London, UK, UK, 1994. Springer-Verlag.
  • [41] S. Zagoruyko and N. Komodakis. Learning to compare image patches via convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 4353–4361. IEEE, 2015.
  • [42] J. Zbontar and Y. LeCun. Stereo matching by training a convolutional neural network to compare image patches. Journal of Machine Learning Research, 17(1-32):2, 2016.
  • [43] S. Zhang, W. Xie, G. Zhang, H. Bao, and M. Kaess. Robust stereo matching with surface normal prediction. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pages 2540–2547. IEEE, 2017.