1 Introduction
Stereo matching aims to estimate correspondences of all pixels between two rectified images [1, 23, 6]. It is a core problem for many stereo vision tasks and has numerous applications in areas such as autonomous vehicles [29], robotics navigation [25], and augmented reality [33].
Stereo matching has been intensively investigated for several decades, with a popular fourstep pipeline being developed. This pipeline includes matching cost calculation, matching cost aggregation, disparity calculation and disparity refinement [23]. The fourstep pipeline dominants existing stereo matching algorithms [24, 22, 9, 21]
, while each of its steps is important to the overall stereo matching performance. Due to the powerful representative capability of deep convolution neural network (CNN) for various vision tasks
[14, 31, 15], CNN has been employed to improve stereo matching performance and outperforms traditional methods significantly [28, 32, 12, 17, 16].Zbontar and LeCun [32] first introduced CNN to calculate the matching cost to measure the similarity of two pixels of two images. This method achieved the best performance on the KITTI 2012 [3], KITTI 2015 [18] and Middlebury [23, 24, 22, 9, 21] stereo datasets at that time. They argued that it is unreliable to consider only the difference of photometry in pixels or handcrafted image features for matching cost. In contrast, CNN can learn more robust and discriminative features from images, and produces improved stereo matching cost. Following the work [32], several methods were proposed to improve the computational efficiency [16] or matching accuracy [28]. However, these methods still suffer from few limitations. First, to calculate the matching cost at all potential disparities, multiple forward passes have to be conducted by the network, resulting in high computational burden. Second, the pixels in occluded regions (i.e., only visible in one of the two images) cannot be used to perform training. It is therefore difficult to obtain a reliable disparity estimation in these regions. Third
, several heuristic postprocessing steps are required to refine the disparity. The performance and the generalization ability of these methods are therefore limited, as a number of parameters have to be chosen empirically.
Alternatively, the matching cost calculation, matching cost aggregation and disparity calculation steps can be seamlessly integrated into a CNN to directly estimate the disparity from stereo images [17, 12]. Traditionally, the matching cost aggregation and disparity calculation steps are solved by minimizing an energy function defined upon matching costs. For example, the SemiGlobal Matching (SGM) method [8] uses dynamic programming to optimize a pathwise form of the energy function in several directions. Both the energy function and its solving process are handengineered. Different from the traditional methods, Mayer et al. [17] and Kendall et al. [12] directly stacked several convolution layers upon the matching costs, and trained the whole neural network to minimize the distance between the network output and the groundtruth disparity. These methods achieve higher accuracy and computational efficiency than the methods that use CNN for matching cost calculation only.
If all steps are integrated into a whole network for joint optimization, better disparity estimation performance can be expected. However, it is nontrivial to integrate the disparity refinement step with the other three steps. Existing methods [4, 19] used additional networks for disparity refinement. Specifically, once the disparity is calculated by CNN, one network or multiple networks are introduced to model the joint space of the inputs (including stereo images and initial disparity) and the output (i.e., refined disparity) to refine the disparity.
To bridge the gap between disparity calculation and disparity refinement, we propose to use feature constancy to identify the correctness of the initial disparity, and then perform disparity refinement using feature constancy. Here, “constancy” is borrowed from the area of optical estimation, where “grey value constancy” and “gradient constancy” are used [2]. “Feature constancy” means the correspondence of two pixels in feature space. Specifically,
the feature constancy includes two terms, i.e., feature correlation and reconstruction error. The correlation between features extracted from left and right images is considered as the first feature constancy term, which measures the correspondence at all possible disparities. The reconstruction error in feature space is considered as the second feature constancy term estimated with the knowledge on initial disparity. Then, the disparity refinement task aims to improve the quality of the initial disparity given the feature constancy, this can be implemented by a small subnetwork. These will be further explained in Sec.
3.3. Experiments on the Scene Flow and KITTI datasets have showed the effectiveness of our disparity refinement approach. Our method seamlessly integrates the disparity calculation and disparity refinement into one network for joint optimization, and improves the accuracy of the initial disparity by a notable margin. Our method achieves the stateoftheart performance on the KITTI 2012 and KITTI 2015 benchmarks. It also has a very high running efficiency.The contributions of this paper can be summarized as follows: 1) We integrate all steps of stereo matching into one network to improve accuracy and efficiency; 2) We perform disparity refinement with a subnetwork using the feature constancy; 3) We achieve the stateoftheart performance on the KITTI benchmarks.
2 Related works
Over the last few years, CNN has been introduced to solve various problems in stereo matching. Existing CNNbased methods can broadly be divided into the following three categories.
2.1 CNN for Matching Cost Learning
In this category, CNN is used to learn the matching cost. Zbontar and LeCun [32] trained a CNN to compute the matching cost between two image patches (e.g., 9 9), which is followed by several postprocessing steps, including crossbased cost aggregation, semiglobal matching, leftright constancy check, subpixel enhancement, median filtering and bilateral filtering. This architecture needs multiple forward passes to calculate matching cost at all possible disparities. Therefore, this method is computationally expensive. Luo et al. [16] introduced a product layer to compute the inner product between the two representations of a siamese architecture, and trained the network as multiclass classification over all possible disparities to reduce computational time. Park and Lee [20] introduced a pixelwise pyramid pooling scheme to enlarge the receptive field during the comparison of two input patches. This method produced more accurate matching cost than [32]. Shaked and Wolf [28] deepened the network for matching cost calculation using a highway network architecture with multilevel weighted residual shortcuts. It was demonstrated that this architecture outperformed several networks, such as the base network from MCCNN [32], the conventional highway network [30], ResNets [7], DenseNet [10], and the ResNets of ResNets [34].
2.2 CNN for Disparity Regression
In this category, CNN is carefully designed to directly estimate the disparity, which enables endtoend training. Mayer et al. [17] proposed an encoderdecoder architecture for disparity regression. The matching cost calculation is seamlessly integrated to the encoder part. The disparity is directly regressed in a forward pass. Kendall et al. [12] used 3D convolutions upon the matching costs to incorporate contextual information and introduced a differentiable “soft argmin” operation to regress the disparity. Both methods can run very fast, with 0.06s and 0.9s consumed on a single Nvidia GTX Titan X GPU, respectively. However, disparity refinement is not included in these networks, which limits their performance.
2.3 Multiple Subnetworks
In this category, different networks are used to handle the four steps for stereo matching. Shaked and Wolf [28] used their highway network for matching cost calculation and an additional global disparity CNN to replace the “winnertakesall” strategy used in conventional matching cost aggregation and disparity calculation steps. This method improves performance in several challenging situations, such as in occluded, distorted, highly reflective and sparse textured regions. Gidaris et al. [4] used the method in [16] to calculate the initial disparity, and then applied three additional neural networks for disparity refinement. Seki and Pollefeys [27] proposed SGMNets to learn the SGM parametrization. They obtained better penalties than the hand tuned method used in MCCNN [32]. Peng et. al [19] built their work upon [17] by cascading an additional network for disparity refinement.
In our work, we incorporate all steps into one network. As a result, all steps can share the same features and can be optimized jointly. Besides, we introduce feature constancy into our network for improved disparity refinement using both feature correlation and reconstruction error. It is clearly demonstrated that better disparity estimation performance can be achieved by our method.
3 Approach
Different from existing methods that use multiple networks for different steps in stereo matching, we incorporate all step into a single network to enable endtoend training. The proposed network consists of three parts: multiscale shared feature extraction, initial disparity estimation and disparity refinement. The framework of the proposed network is shown in Fig. 1, and the network architecture is described in Table 1.
3.1 Stem Block for Multiscale Feature Extraction
The stem block extracts multiscale shared features from the two input images for both initial disparity estimation and disparity refinement subnetworks. It contains two convolution layers with stride of 2 to reduce the resolution of inputs, and two deconvolution layers to upsample the outputs of the two convolution layers to fullresolution. The upsampled features are fused through an additional
convolution layer. An illustration is shown in Figure 1. The outputs of this stem block can be divided into three types:
The outputs of the second convolution layer (i.e., for the left image and for the right image). Correlation with a large displacement (i.e., 40) is performed between and to capture the longrange but coarsegrained correspondence between two images. It is used by the first subnetwork for initial disparity estimation.

The outputs of the first convolution layer (i.e., and ). They are first compressed to fewer channels to obtain and through a convolution layer with a kernel size of 33, on which correlation with a small displacement (i.e., 20) is performed to capture shortrange but finegrained correspondence, which is complementary to the former correlation. Besides, these features also act as the first feature constancy term used by the second subnetwork.

Multiscale fusion features (i.e., and ). They are first used as skip connection features to bring detailed information for the first subnetwork. They are then used to calculate the second feature constancy term for the second subnetwork.
3.2 Initial Disparity Estimation Subnetwork
This subnetwork generates a disparity map from “” and “” through an encoderdecoder architecture, which is inspired by DispNetCorr1D [17]. DispNetCorr1D can only output disparity of half resolution. By using the fullresolution multiscale fusion features as skip connection features, we are able to estimate initial disparity of full resolution. The multiscale fusion features are also used to calculate the reconstruction error, as will be described in Sec. 3.3. In this subnetwork, a correlation layer is first introduced to calculate the matching costs in feature space. There is a tradeoff between accuracy and computational efficiency for matching cost calculation. That is, if matching cost is calculated using highlevel features, more details are lost and several similar correspondences cannot be distinguished. In contrast, if matching cost is calculated using lowlevel features, the computational cost is high as feature maps are too large, and the receptive field is too small to capture robust features.
The matching cost is then concatenated with features from the left image. By concatenation, we expect the subnetwork to consider lowlevel semantic information when performing disparity estimation over the matching costs. This to some extend help aggregate the matching cost and improves disparity estimation.
Disparity estimation is performed in the decoder part at different scales, where skip connection is introduced at each scale, as illustrated in Table 1. For the sake of computational efficiency, the multiscale fusion features (described in Sec. 3.1) are only skip connected to the last layer of the decoder to perform fullresolution disparity estimation. This subnetwork is called DESnet.
Type  Name  k  s  c I/O  Input  
Stem Block for Multiscale Shared Features Extraction  
Conv 

7  2  3/64 


Deconv 

4  2  64/32 


Conv 

5  2  64/128 


Deconv 

8  4  128/32 


Conv 

1  1  64/32 


Initial Disparity Estimation Subnetwork  
Corr  corr1d  1  1  128/81  conv2a, conv2b  
Conv  conv_redir  1  1  128/64  conv2a  
Conv  conv3  3  2  145/256  corr1d+conv_redir  
Conv  conv3_1  3  1  256/256  conv3  
Conv  conv4  3  2  256/512  conv3_1  
Conv  conv4_1  3  1  512/512  conv4  
Conv  conv5  3  2  512/512  conv4_1  
Conv  conv5_1  3  1  512/512  conv5  
Conv  conv6  3  2  512/1024  conv5_1  
Conv  conv6_1  3  1  1024/1024  conv6  
Conv  disp6  3  1  1024/1  conv6_1  
Deconv  uconv5  4  2  1024/512  conv6_1  
Conv  iconv5  3  1  1025/512  uconv5+disp6+conv5_1  
Conv  disp5  3  1  512/1  iconv5  
Deconv  uconv4  4  2  512/256  iconv5  
Conv  iconv4  3  1  769/256  uconv4+disp5+conv4_1  
Conv  disp4  3  1  256/1  iconv4  
Deconv  uconv3  4  2  256/128  iconv4  
Conv  iconv3  3  1  385/128  uconv3+disp4+conv3_1  
Conv  disp3  3  1  128/1  iconv3  
Deconv  uconv2  4  2  128/64  iconv3  
Conv  iconv2  3  1  193/64  uconv2+disp3+conv2a  
Conv  disp2  3  1  64/1  iconv2  
Deconv  uconv1  4  2  64/32  iconv2  
Conv  iconv1  3  1  97/32  uconv1+disp2+conv1a  
Conv  disp1  3  1  32/1  iconv1  
Deconv  uconv0  4  2  32/32  iconv1  
Conv  iconv0  3  1  65/32 


Conv  disp0  3  1  32/1  iconv0  
Disparity Refinement Subnetwork  
Warp  w_up_1b2b      32/32  up_1b2b  
Conv  r_conv0  3  1  65/32 


Conv  r_conv1  3  2  32/64  r_conv0  
Conv 

3  1  64/16 


Corr  r_corr  1  1  16/41  c_conv1a, c_conv1b  
Conv  r_conv1_1  3  1  105/64  r_conv1+r_corr  
Conv  r_conv2  3  2  64/128  r_conv1_1  
Conv  r_conv2_1  3  1  128/128  r_conv2  
Conv  r_res2  3  1  128/1  r_conv2_1  
Deconv  r_uconv1  4  2  128/64  r_conv2_1  
Conv  r_iconv1  3  1  127/64  r_uconv1+r_res2+r_conv1_1  
Conv  r_res1  3  1  64/1  r_iconv1  
Deconv  r_uconv0  4  2  64/32  r_iconv1  
Conv  r_iconv0  3  1  65/32  r_uconv1+r_res1+r_conv0  
Conv  r_res0  3  1  32/1  r_iconv0 
3.3 Disparity Refinement Subnetwork
Although the disparity map estimated in Sec. 3.2
is already good, it still suffers from several challenges such as depth discontinuities and outliers. Consequently, disparity refinement is required to further improve the depth estimation performance.
In this paper, we perform disparity refinement using feature constancy. Specifically, after obtaining the initial disparity using DESnet, we calculate the two feature constancy terms (i.e., feature correlation and reconstruction error ) . Then, the task of disparity refinement is to obtain the refined disparity considering these three types of information, i.e.,
(1) 
Specifically, the first feature constancy term is calculated as the correlation between the feature maps of the left and right images (i.e., and ). measures the correspondence of two feature maps at all displacements in disparity range that considered. It would produce large values at correct disparities. The second feature constancy term is calculated as the reconstruction error of the initial disparity, i.e., the absolute difference between the multiscale fusion features (Sec. 3.1 ) of the left image and the backwarped features of the right image. Note that, to calculate , only one displacement is conducted at each location in the feature maps, which relies on the corresponding value of the initial disparity. If the reconstruction error is large, the estimated disparity is more likely to be incorrect or from occluded regions.
In practice, given the initial disparity produced by the disparity estimation subnetwork (Sec. 3.2), the disparity refinement subnetwork estimates the residual to the initial disparity. The summation of the residual and the initial disparity is considered as the refined disparity map. Since both the initial disparity and the two feature constancy terms are used to produce the disparity map (as shown in Eq. 1), the disparity estimation performance is expected to be improved. This subnetwork is called DRSnet. Note that, since the four steps for stereo matching are integrated into a single CNN network, endtoend training is ensured.
3.4 Iterative Refinement
To extract more information from the multiscale fusion features and to ultimately improve the disparity estimation accuracy, an iterative refinement approach is proposed. Specifically, the refined disparity map produced by the second subnetwork (Sec. 3.3) is considered as a new initial disparity map, the feature constancy calculation and disparity refinement processes are then repeated to obtain a new refined disparity. This procedure is repeated until the improvement between two iterations is small. Note that, as the number of iterations is increased, the improvement decreases.
4 Experiments
In this section, we evaluate our method iResNet (iterative residual prediction network) on two datasets, i.e., Scene Flow [17] and KITTI [3, 18]. The Scene Flow dataset [17] is a synthesised dataset containing 35, 454 training image pairs and 4, 370 testing image pairs. Dense groundtruth disparities are provided for both training and testing sets. Besides, this dataset is sufficiently large to train a model without overfitting. Therefore, the Scene Flow dataset [17] is used to investigate different aspects of our method in Sec. 4.1. The KITTI dataset includes two subsets, i.e., KITTI 2012 and KITTI 2015. The KITTI 2012 dataset consists of 194 training image pairs and 195 test image pairs, while the KITTI 2015 dataset consists of 200 training image pairs and 200 test image pairs. These images were recorded in real scenes under different weather conditions. Our method is further compared to the stateoftheart methods on the KITTI dataset (Sec. 4.2), with the best results being achieved.
Our method was implemented in CAFFE
[11]. All models were optimized using the Adam method [13] with , = 0.999, and a batch size of 2. “Multistep” learning rate was used for the training. Specifically, for training on the Scene Flow dataset, the learning rate was initially set to and then reduced by a half at the 20kth, 35kth and 50kth iterations, the training was stopped at the 65kth iteration. This training procedure was repeated for an additional round to further optimize the model. For finetuning on the KITTI dataset, the learning rate was set to for the first 20k iterations and then reduced to for the subsequent 120k iterations.Data augmentation was also conducted for training, including spatial and chromatic transformations. The spatial transformations include rotation, translation, cropping and scaling, while the chromatic transformations includes color, contrast and brightness transformations. This data augmentation can help to learn a robust model against illumination changes and noise.
4.1 Ablation Experiments
In this section, we present several ablation experiments on the Scene Flow dataset to justify our design choices. For evaluation, we use the endpointerror (EPE), which is calculated as the average euclidean distance between estimated and groundtruth disparity. We also use the percentage of disparities with their EPE larger than pixels ( px).
4.1.1 Multiscale Skip Connection
In Sec. 3, multiscale skip connection is used to introduce features from different levels to improve disparity estimation and refinement performance. To demonstrate its effectiveness, the multiscale skip connection scheme of our network was replaced by a singlescale skip connection scheme, the comparative results are shown in Table 3. It can be observed that the multiscale skip connection scheme outperforms its singlescale counterpart, with the EPE being reduced from 2.55 to 2.50. That is because, the output of the first convolution layer contains high frequency information, it produces high reconstruction error for both regions along object boundaries and regions with large color changes. Note that, regions on an object surface far from boundaries usually have a very accurate initial disparity estimation (i.e., the true reconstruction error is small), although large color changes occur due to texture variation. Therefore, the reconstruction errors given by the first convolution layer for these regions are inaccurate. In this case, multiscale skip connection is able to improve the reliability of resulted reconstruction errors. Besides, introducing highlevel features is also useful for feature constancy calculation, as higherlevel features leverage more context information with a wide field of view.
4.1.2 Feature Constancy for Disparity Refinement
Model  1px  3px  5px  EPE 

Singlescale  10.90  5.23  3.74  2.55 
Multiscale  10.24  4.93  3.54  2.50 
Method  EPE  Params.  Run time(ms) 

CRL [19]  1.60  78.77M  162 
iResNet  1.40  43.11M  90 
To seamlessly integrate the initial disparity estimation subnetwork (DESnet) and the disparity refinement subnetwork (DRSnet) into a whole network, the feature constancy calculation and its subsequent subnetwork play an important role. To demonstrate its effectiveness, we first removed all feature constancy used in our network (as shown in Table 1) and then retrained the model. The results are shown in Table 2. It can be observed that if no feature constancy is introduced for disparity refinement, the performance improvement is very small, with EPE being reduced from 2.81 to 2.72.
Then, we evaluate the importance of the three information in Eq. (1), i.e., initial disparity , feature correlation , and reconstruction error produced by the initial disparity, as explained in Sec. 3. The results are shown in Table 2. It is observed that, the reconstruction error plays the major role for the performance improvement. If the reconstruction error is removed, EPE is increased from 2.50 to 2.70. That is because this term provides the some knowledge about the initial disparity. That is, regions with poor initial disparity can be identified and then be contrapuntally refined. Besides, removing initial disparity or feature correlation from the disparity refinement subnetwork slightly degrades the overall performance. Their EPE values are increased from 2.50 to 2.56 and 2.61, respectively. If all the three parts are incorporated, the disparity refinement network can achieve the best performance.
4.1.3 Iterative Refinement
Iterative feature constancy calculation can further improve the overall performance. In practice, we do not train multiple DRSnets. Instead, we directly stack another DRSnet, whose weights are exactly the same as the original DRSnet, at the top of the whole network. As the number of iterations is increased, the performance improvement is decreased rapidly. Specifically, the first iteration can reduce EPE from 2.50 to 2.46, while the second iteration can only reduce EPE from 2.46 to 2.45. Moreover, there is no obvious performance improvement after the third iteration of refinement. It can be concluded that: 1) Our framework can efficiently extract useful information for disparity estimation using feature constancy information; 2) The information contained in feature space is still limited. Therefore, it is possible to improve the disparity refinement performance by introducing more powerful features.
To further demonstrate the effectiveness of iterative refinement, the disparity estimation results for 2 iterations are shown in Fig. 2. It can be observed that, lots of details cannot be accurately estimated in the initial disparity, e.g., the areas shown in red rectangles. However, after two iterations of refinement, most inaccurate estimations can be corrected, and the refined disparity map looks more smooth.
4.1.4 Feature Constancy vs Color Constancy
In this experiment, the superiority of feature constancy for disparity refinement over color constancy is demonstrated. Our method was compared to the cascade residual learning (CRL) method [19], which calculated the reconstruction error in the color space. Intuitively, calculating the reconstruction error in the feature space could obtain more robust results, since the learned features are less sensetive to noise and luminance changes. Besides, by sharing the features with the first network, the second network can be designed shadower, which would improve the feature effections, and reduce running time. CRL used one network (i.e., DispNetC) for disparity calculation and an additional network (i.e., DispResNet) for disparity refinement. For fair comparison, we also use DispNetC [17] without any finetuning as our disparity estimation subnetwork. Note that, in their experiments on the Scene Flow dataset, disparity images (and their corresponding stereo pairs) with more than 25% of their disparity values greater than 300 were removed. We followed the same protocol in this comparison. Comparative results are shown in Table 4. The EPE result of CRL is taken from the paper [19], and its run time was tested on Nvidia Titan X (Pascal) using our implementation. It can be seen that our method (using feature constancy) significantly outperforms CRL (using color constancy). The EPE values achieved by our iResNet method and the CRL method are 1.40 and 1.60, respectively. Moreover, our method also requires fewer parameters and costs less computational time.
4.2 Benchmark Results
We further compared our method to several existing methods on the KITTI 2012 and KITTI 2015 benchmarks, including GCNET [12], LResMatch [28], SGMNet [27], SsSMnet [35], PBCP [26], Displets [5], and MCCNN [32].
For the evaluation on KITTI 2012, we used the percentage of erroneous pixels in nonoccluded (OutNoc) and all (OutAll) areas. Here a pixel is considered to be erroneous if its disparity EPE is larger than pixels ( px). For the evaluation on KITTI 2015, we used the percentage of erroneous pixels in background (), foreground () or all pixels () in the nonoccluded or all regions. Here, a pixel is considered to be correct if its disparity EPE is less than 3 or 5 pixels.
The results are shown in Tables 5 and 6. For our method, the results for both of the basic model (without disparity refinement) and the final iResNet model (with disparity refinement of 2 iterations) are presented. It is clear that the disparity refinement subnetwork can consistently improve the performance by a notable margin. Moreover, our iResNet model achieves the best disparity estimation performance on both the KITTI 2012 and KITTI 2015 datasets in different scenarios. Note that, our method is also highly efficient, it achieves the shortest run time on the KITTI 2012 dataset. The overall run time tested on a single Nvidia Titan X (Pascal) GPU is only 0.12s.
Figure 3 illustrates few qualitative results achieved by our method and several stateoftheart methods on the KITTI 2015 dataset. It can be observed that our method produces more smooth disparity estimation results, as shown in the rectangle marked in Figure 3. On one hand, our disparity refinement subnetwork can effectively improve the quality of the initial disparity estimated by DESnet, with many details being recovered. On the other hand, our method gives much better results than other methods in the upper parts of these images as denoted in red rectangles in Fig. 3. In fact, the upper parts of these images correspond to sky or regions beyond the working distance of a lidar. In that case, groundtruth disparity cannot be provided for these regions for training, making the learning based methods to deteriorate. From Fig. 3, we can see that the performance of other methods in these regions is relatively poor. However, our method can still provide acceptable results, with more accurate disparity estimation along boundaries. This also indicates that our method generalizes well on unseen data. To further demonstrate the generalization capability of our method, the iResNeti2, CRL [19] and DispNetC [17] models trained on the KITTI 2015 training set are tested on the KITTI 2015 and 2012 test sets without finetuning. The achieved D1all errors are shown in Table 7 in this page. It is clear that our method achieves the smallest performance drop.
5 Conclusion
In this work, we propose a network architecture to integrate the four steps of stereo matching for joint training. Our network first estimates an initial disparity, and then uses two feature constancy terms to refine the disparity. The refinement is performed using both feature correlation and reconstruction error, which makes the network easy for optimization. Experimental results show that the proposed method achieves the stateoftheart disparity estimation performance on the KITTI 2012 and KITTI 2015 datasets. Moreover, our method is also highly efficient for calculation.
References
 [1] S. T. Barnard and M. A. Fischler. Computational stereo. Acm Computing Surveys, 14(4):553–572, 1982.

[2]
T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. High accuracy optical flow estimation based on a theory for warping. In European Conference on Computer Vision, volume 3024, pages 25–36, 2004.

[3]
A. Geiger. Are we ready for autonomous driving? the kitti vision benchmark suite. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3354–3361, 2012.
 [4] S. Gidaris and N. Komodakis. Detect, replace, refine: Deep structured prediction for pixel wise labeling. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 [5] F. Guney and A. Geiger. Displets: Resolving stereo ambiguities using object knowledge. In IEEE Conference on Computer Vision and Pattern Recognition, pages 4165–4175, 2015.
 [6] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, 2003.
 [7] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.

[8]
H. Hirschmüller. Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis
&
Machine Intelligence, 30(2):328, 2008.  [9] H. Hirschmüller and D. Scharstein. Evaluation of cost functions for stereo matching. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, 2007.
 [10] G. Huang, Z. Liu, K. Q. Weinberger, and V. D. M. Laurens. Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 [11] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
 [12] A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry. Endtoend learning of geometry and context for deep stereo regression. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 [13] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[14]
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In International Conference on Neural Information Processing Systems, pages 1097–1105, 2012.
 [15] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Computer Vision and Pattern Recognition, pages 3431–3440, 2015.

[16]
W. Luo, A. G. Schwing, and R. Urtasun. Efficient deep learning for stereo matching. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5695–5703, 2016.
 [17] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 4040–4048, 2016.
 [18] M. Menze and A. Geiger. Object scene flow for autonomous vehicles. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3061–3070, 2015.
 [19] J. Pang, W. Sun, J. S. Ren, C. Yang, and Q. Yan. Cascade residual learning: A twostage convolutional neural network for stereo matching. In International Conference on Computer Vision Workshop, 2017.
 [20] H. Park and K. M. Lee. Look wider to match image patches with convolutional neural networks. IEEE Signal Processing Letters, PP(99):1–1, 2017.
 [21] D. Scharstein, H. Hirschmüller, Y. Kitajima, G. Krathwohl, N. Nei, X. Wang, and P. Westling. Highresolution stereo datasets with subpixelaccurate ground truth. In German Conference on Pattern Recognition, pages 31–42, 2014.
 [22] D. Scharstein and C. Pal. Learning conditional random fields for stereo. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, 2007.
 [23] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense twoframe stereo correspondence algorithms. International Journal of Computer Vision, 47:7–42, 2002.
 [24] D. Scharstein and R. Szeliski. Highaccuracy stereo depth maps using structured light. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 195–202, 2003.
 [25] K. Schmid, T. Tomic, F. Ruess, H. Hirschmüller, and M. Suppa. Stereo vision based indoor/outdoor navigation for flying robots. IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 3955–3962, 2013.
 [26] A. Seki and M. Pollefeys. Patch based confidence prediction for dense disparity map. In British Machine Vision Conference, volume 10, 2016.
 [27] A. Seki and M. Pollefeys. SGMNets: Semiglobal matching with neural networks. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017.
 [28] A. Shaked and L. Wolf. Improved stereo matching with constant highway networks and reflective confidence learning. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 [29] S. Sivaraman and M. M. Trivedi. A review of recent developments in visionbased vehicle detection. IEEE Intelligent Vehicles Symposium, pages 310–315, 2013.
 [30] R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks. CoRR, abs/1505.00387, 2015.
 [31] N. Wang and D. Y. Yeung. Learning a deep compact image representation for visual tracking. Advances in Neural Information Processing Systems, pages 809–817, 2013.

[32]
J. Zbontar and Y. LeCun. Stereo matching by training a convolutional neural network to compare image patches. Journal of Machine Learning Research, 17(132):2, 2016.
 [33] N. Zenati and N. Zerhouni. Dense stereo matching with application to augmented reality. In IEEE International Conference on Signal Processing and Communications, pages 1503–1506, 2008.
 [34] K. Zhang, M. Sun, X. Han, X. Yuan, L. Guo, and T. Liu. Residual networks of residual networks: Multilevel residual networks. IEEE Transactions on Circuits and Systems for Video Technology, PP(99):1–1, 2016.
 [35] Y. Zhong, Y. Dai, and H. Li. Selfsupervised learning for stereo matching with selfimproving ability. CoRR, abs/1709.00930, 2017.
Comments
There are no comments yet.