1 Introduction
Stereo matching is a classic yet important problem for many computer vision tasks (
e.g., 3D reconstruction [7] and autonomous vehicles [6]). Particularly, given a rectified image pair captured by stereo cameras, one aims at estimating the disparity of each pixel between the two images. Traditionally, a stereo matching pipeline starts from matching cost computation and cost aggregation. Further optimization and refinement lead to the output disparity
[13]. Recent advances in deep learning has inspired a lot of endtoend convolutional neural networks (CNNs) for stereo matching,
e.g., [18, 23]. Unlike the traditional wisdom, an endtoend CNN integrates the stereo matching pipeline into a holistic deep architecture by learning from the training data. Under confined scenarios with proper training data (e.g., the KITTI dataset [6]), the endtoend deep stereo models achieve unprecedented stateoftheart performance.However, it remains difficult to generalize a pretrained deep stereo model to a novel scenario. Firstly, the contents in the source domain may have very different characteristics from the target domain. Moreover, real stereo pairs collected with different stereo modules suffer from several degenerations—e.g., noise corruption, photometric distortions, imperfections in rectification—to different extents. Directly feeding a stereo pair of the target domain to a CNN pretrained from another domain deteriorates its performance significantly. Consequently, stateoftheart approaches, e.g., [18, 28], train their models with synthetic datasets [23], then perform finetuning on a fewer amount of domainspecific data with groundtruths. Unfortunately, besides a few public datasets for research purpose, e.g., the KITTI dataset [6] and the Middlebury dataset [32], it is expensive and troublesome to collect real stereo pairs with accurate groundtruth disparities.
To resolve this dilemma, we propose a selfadaptation approach to generalize deep stereo matching methods to novel domains. We utilize synthetic training data and stereo pairs of the target domain, where only the synthetic data have known disparity maps. Our approach is compatible with endtoend deep stereo methods, e.g., [23, 28], guiding a pretrained model to gradually adapt to the target scenario. We start our explorations by feeding real stereo pairs from different domains to models pretrained with synthetic data, resulting in two empirical observations:

Generalization glitches: a pretrained model does not generalize well on the target domain—the produced disparity maps can be blurry at object edges and erroneous at illposed regions;

Scale diversity: feeding a properly upsampled stereo pair (the same stereo pair at a finer scale) leads to another disparity map with more meaningful details, e.g., sharper object boundaries, more highfrequency contents of the scene.
To avoid the issues of (i) while exploiting the benefits of (ii), we propose an iterative regularization scheme for finetuning deep stereo matching models.
We formulate the CNN training as an iterative optimization problem with graph Laplacian regularization. On one hand, we let the CNN learn its own finergrain output; on the other hand, a graph Laplacian regularization is imposed to discriminatively retain the useful edges while smoothing out the undesired artifacts. Our formulation, composing of a data term and a smoothness term, is solved iteratively, leading to a model well suited for the novel domain e.g., Figure 1. The proposed selfadaptation approach is called zoom and learn, or ZOLE, for short. We demonstrate the effectiveness of our approach to two different domains: daily scenes collected by smartphone cameras, and street views captured from the perspective of a driving car.
2 Related Works
We first review several stereo matching algorithms based on convolutional neural networks (CNNs). We then turn to related works on graph Laplacian regularization and iterative regularization/filtering.
Deep stereo algorithms: Recent breakthroughs in deep learning have reshaped the paradigms of many computer vision tasks, including stereo matching. Early works employing CNNs for stereo matching focuses on learning a robust similarity measure for matching cost computation e.g., [11, 37]. To produce disparity maps, modules in the traditional stereo matching pipeline are indispensable. The remarkable work, DispNet, proposed by Mayer et al [23]
, is the first endtoend CNN approach for stereo matching, where an encoderdecoder architecture is employed for supervised learning. Other recent works with leading performance include CRL
[28], GCNET [18], DRR [8], etc. These works explore different CNN architectures tailormade for stereo matching. They achieve superior results on the KITTI 2015 stereo benchmark [6], a benchmark containing driving scenes. Despite the success of these methodologies, to adopt them in a novel domain, it is necessary to finetune the models with new domainspecific data. Unfortunately, in practice, it is very difficult to collect accurate disparity maps for training [6, 32].To mitigate this problem, some recent works proposed semi/unsupervised approaches to train a CNN model for stereo matching (or its related problem, monocular depth estimation). This category of works is essentially based on leftright consistency/warping, e.g., [10, 19, 38, 39]
. For instance, one may synthesize the left (or right) view according to the estimated left (or right) disparity and the right (or left) view for computing a loss function. However, leftright consistency becomes vulnerable when the stereo pairs are imperfect,
e.g., when the two views have different photometric distortions. Another line of research by Tonioni et al. [34] propose to finetune a pretrained model to achieve domain adaptation. Their method relies on the results of other stereo methods and confidence measures. Our work also performs finetuning with a pretrained stereo model. In contrast, we do not rely on external models or setups: our selfsupervised domain adaptation method lets the CNN discriminatively learn the useful details from its own finergrain outputs.Other related works: According to [21, 33], graph Laplacian regularization is particularly useful for the recovery of piecewise smooth signals, e.g., disparity maps. By having an appropriate graph, edges can be preserved while undesired defects are suppressed [26, 27]. Hence, we propose to apply graph Laplacian regularization to selectively learn and preserve the meaningful details from the higherresolution disparity outputs.
Iterative regularization/filtering is an important technique in classic image restoration [17, 24, 25]. To restore a corrupted image, it is regularized iteratively through a variational formulation, so that its quality improves at each iteration. To utilize scale diversity while avoiding generalization glitches (as mentioned in Section 1), we embed iterative regularization into the CNN training process, making the model parameters improve gradually. Different from iterative refinement via a stacked neural network architecture, e.g., [15, 35], our iterative process occurs during training.
3 Observations
We first present two phenomena by feeding realworld stereo pairs in different domains to deep stereo models pretrained with synthetic datasets (e.g., FlyingThings3D [23], MPI Sintel [3], Virtual KITTI [5]). Underlying reasons for these phenomena will also be presented. We choose the offtheshelf DispNet [23] architectures—both the one with explicit correlation (DispNetC) and the one based on convolution only (DispNetS)—for our discussions. Their encoderdecoder architectures are representative and also widely used in the deep learning literature, e.g., [1, 22, 31].
3.1 Generalization Glitches
In general, a stereo model pretrained with synthetic data does not perform well on real stereo data in a particular domain. Firstly, the contents of the synthetic data may differ from that of the target domain. Moreover, real stereo pairs inevitably suffer from defects arising from the imaging process. For instance, they are likely corrupted by noise. Besides, the two views may have different photometric distortions due to inconformity of the two cameras. In some cases, the stereo pair may not even be well rectified, e.g., two corresponding pixels are not on the same scanline. All the above factors deteriorate the performance of a model pretrained with synthetic data.
For illustration, we use smartphones equipped with two rearfacing cameras to collect a few stereo pairs (of size 10241024), then perform the following tests. We first adopt the released DispNetC model pretrained with the FlyingThings3D dataset [23]. Since stereo pairs of smartphones have small disparity values, we also finetune a model from the released model, where we remove those FlyingThings3D stereo pairs with maximum disparity larger than 80. Data augmentation is introduced for the two views individually during training, please refer to Section 5 for more details. The resulting model is called DispNetC80. Both DispNetC and DispNetC80 perform very well on the FlyingThings3D dataset, but are problematic when applied to real smartphone data. Figure 1 shows a few disparity estimates of DispNetC and DispNetC80. As can be seen, the results are blurry at object edges. Moreover, at illposed regions, i.e., object occlusions, repeated patterns, and textureless regions, the disparity maps are erroneous. In this work we call this generalization glitches, meaning the mistakes that a deep stereo model (pretrained with synthetic data) make when it is applied to real stereo pairs of a certain domain.
3.2 Scale Diversity
In spite of the unpleasant generalization glitches, we find that deep stereo models have an encouraging property. Suppose we have a stereo pair , where and are the left and the right views, respectively. We denote a deep stereo model parameterized by as . By applying it to the stereo pair leads to a disparity map . The operation of upsampling by times is denoted as ， while downsampling by times is . By passing an upsampled stereo pair to then downsampling the result, we obtain another disparity map, , of the same size as ,
(1) 
Note that after downsampling, the factor is necessary for making to have the correct scaling. Compared to , usually contains more highfrequency details. To see this, we apply the released DispNetC model to a few stereo pairs captured by smartphones. We make the original size of the stereo pairs as . For each of them, we estimate three disparity maps based on (1) with . Visual results are shown in Figure 2. We see that as grows, more fine details are produced on the disparity maps.
Network  

896  1280  1664  2048  2432 [t]  
DispNetC  14.26%  9.97%  8.81%  9.17%  10.53% [t] 
DispNetS  18.95%  11.61%  9.18%  8.64%  9.08% [t] 
However, a bigger does not necessarily mean better results. For further inspection, we adopt the released DispNetC and DispNetS models (trained with the FlyingThings3D dataset) and measure their performance on the training set of KITTI stereo 2015 [6] at different resolutions. The results, in terms of the percentage of pixels with an error greater than 3, or threepixel error rate (3ER), are listed in Table 1. We see that as the input resolution increases, the performance first improves then deteriorates. Because:

Upsampling the stereo pairs enables the model to perform stereo matching at a localized manner with subpixel accuracy. Hence, more details on the stereo pairs are taken into account for computation, leading to disparity estimates with extra highfrequency contents;

A finerscale input translates to a smaller effective search range (or receptive field). As a CNN becomes too “shortsighted,” it lacks nonlocal information to estimate a proper disparity map, and its performance start to decline.
This phenomenon—different results can be observed with different input scales—is called scale diversity, akin to the concept of transmit diversity in communication [30]. We find that scale diversity also exists in other problems, e.g., optical flow estimation [15, 23] and image segmentation [22], please refer to the supplementary material for more details.
4 Zoom and Learn
To achieve effective selfadaptation, our approach—zoom and learn (ZOLE)—finetunes a model pretrained with synthetic data. It iteratively suppresses generalization glitches while utilizing the benefits of scale diversity.
4.1 Graph Laplacian Regularization
Graph Laplacian regularization is employed in a wide range of image restoration literature, e.g., [4, 9, 24]. It is also proven to be effective for the recovery of piecewise smooth signals [14, 26, 33]. We adopt graph Laplacian regularization (on a patchbypatch basis) to guide the learning of CNNs. Graph Laplacian regularization assumes the groundtruth signal —in our case, a patch on the groundtruth disparity—is smooth with respect to a predefined graph with vertices. Specifically, it imposes that the value of , i.e., the graph Laplacian regularizer, should be small for the groundtruth patch , where is the graph Laplacian matrix of graph . Given a disparity map produced by a deep stereo model, we compute the values of the graph Laplacian regularizers for the patches on . The obtained values are summed up as a graph Laplacian regularization loss for CNN training.
For an effective regularization with graph Laplacian, it is critical to constructing a graph properly. We employ the graph structure of [12, 26] which works well for disparity map denoising. For illustration, we first introduce the concept of exemplar patches. Exemplar patches are a set of patches, where , that are statistically related to the groundtruth patch . For instance, an exemplar patch can be a rough estimate of , or the colocated patch on the left image, etc. Our choices of the exemplar patches will be presented in Section 4.2. With the exemplar patches, the edge weight connecting pixel and pixel on patch is given by
where is a threshold, is a distance measure between pixel and pixel . Hence, the resulting graph is an neighborhood graph, i.e., there is no edge connecting two pixels with a distance greater than . We choose an individual value of for each patch, making every vertex of the graph has at least 4 edges. The distance measure is defined as follows:
(2) 
where and denote the th and the th entries of , respectively, so the first term of (2) measures the Euclidean distance between pixels and in a dimensional space defined by the exemplar patches. is simply the spatial distance (length) between pixels and , and is a constant weight, empirically set to be a small value .
The adjacency matrix of is denoted as , where the th entry of is . The degree matrix of is a diagonal matrix , its th diagonal entry is . Then the graph Laplacian is given by , leading to the graph Laplacian regularizer . From the analysis of [26], graph Laplacian regularizer is an adaptive metric. If the same edge (or gradient ) pattern appears in the majority of the exemplar patches, minimizing the graph Laplacian regularizer promotes the very edge pattern; if the exemplar patches are inconsistent, graph Laplacian regularization leads to a smoothed patch. We exploit this property to guide a deep stereo model to selectively learn the desired details.
4.2 Training by Iterative Regularization
We borrow the notion of iterative regularization [24] for generalizing deep stereo models to novel domains, giving rise to the proposed zoom and learn approach. Suppose we have a deep stereo model (parameterized by ) pretrained with synthetic data. We also have a set of stereo pairs, , where the first of them are real stereo pairs of the target domain while the rest pairs are synthetic data, among which only the synthetic data has ground truth disparities ().
We solve for a new set of model parameters at iteration . For a constant , we first create a set of “groundtruths” for the real stereo pairs by zooming (upsampling), i.e.,
(3) 
From Section 3.2, contains more details than . We divide a disparity map into
square patches tiling it where each patch is a vector of length
. The vectorization operator is denoted as so that . The by matrix extracting the th patch from is denoted as . With these settings, we formulate the following iterative optimization problem,(4)  
Here and are the th patches of and , respectively. and are positive constants. Our optimization problem first minimizes over each patch on the stereo pairs: the first term (data term) drives to be similar to ; and the second term (smoothness term) is a graph Laplacian regularizer induces from the matrix . The third term of lets be a feasible deep stereo model; it literally means that: a deep stereo model works well for the target domain should also has reasonable performance on the synthetic data.
At iteration , a graph (, ), and hence the corresponding graph Laplacian, , are precomputed for calculating a loss . We choose the following three exemplar patches for building :
where , and are constants. In other words, , , and are the th patches of the left image , the current prediction and the finergrain prediction (3), respectively.
Our chosen exemplar patches lead to a graph Laplacian regularizer that discriminatively retain the desired details from whilst smoothing out possible artifacts on both and . We analyze how the patches , and affects the behavior of the graph Laplacian:

Suppose a desired object boundary (denoted by ) does not appear in the current predicted patch . However, it has appeared in the finergrain patch by virtue of scale diversity (Section 3.2), then should also appear in ; otherwise the CNN cannot generate on . In this case, both and have boundary , resulting in a Laplacian that promotes on .

Suppose due to generalization glitches, an undesired pattern (denoted as ) is produced in one exemplar patch, or . Since is absence in the other exemplar patches, the corresponding graph Laplacian penalizes on .
Hence, our graph Laplacian regularizer guides the CNN to only learn the meaningful details.
4.3 Practical Algorithm
Iteratively solving the optimization problem (4) can be achieved by training the model
with standard backpropagation
[20]. We hereby present how to use the proposed formulation for finetuning a pretrained model in practice. Since a disparity map is tiled by patches, with , the first term in (4) equals . Hence, the objective of (4) can be rewritten as:(5) 
We see that the first two terms of (5) are simply L1 loss with different weightings for the target domain and the synthetic data. The third term is the proposed graph Laplacian regularization loss, we discuss its backpropagation in the supplementary material.
In general, there are a lot of training examples ( is large), yet in practice, every training iteration can only take in a batch of
training examples and perform stochastic gradient descent. As a result, we shuffle all the
stereo pairs and sequentially take out of them to form a training batch for the current iteration. For a synthetic stereo pair () in the batch, we directly use its L1 loss for backpropagation since its groundtruth is known. Otherwise, for a stereo pair with in the batch, we first feed its upsampled version to the CNN for computing the finergrain “groundtruth” , we also compute the current estimate and hence the graph Laplacian matrices for each patch. With and the precomputed ’s, , both the L1 loss and the graph Laplacian regularization loss are employed for backpropagation.For every training iterations, we perform a validation procedure with leftright consistency, using another set of stereo pairs in the target domain. We first estimate the disparity maps with the uptodate model then synthesize left images with the estimated disparity maps and the right images. Then we compute the peak signaltonoise ratios (PSNRs) between the synthesized left images and the genuine ones. The average PSNR reflects the performance of the current model. During the training process, we keep track of the best PSNR value and its corresponding model . After training iterations, we terminate the training and output . Algorithm 1 summarizes the key steps of our selfadaptation approach.
5 Experimentation
In this section, we generalize deep stereo matching for two different domains in the real world: daily scenes captured by smartphone cameras, and street views from the perspective of a driving car (the KITTI dataset [6]). We again choose the representative DispNetC [23] architecture for our experiments.
5.1 Daily Scenes from Smartphones
Recently, many companies (e.g., Apple, Samsung) have equipped their smartphones with two rearfacing cameras. Stereo pairs collected by these cameras have small disparity and possibly contaminated by noise due to the small area of their image sensors. With two views of the same scene, stereo matching is applied to estimate a dense disparity map for subsequent applications, e.g., synthetic bokeh [2] and segmentation [22].
We aim at generalizing the released DispNetC model (pretrained with the FlyingThings3D dataset [23]) for daily scenes captured by smartphones cameras. For this purpose, we used various models of smartphones to collect , and stereo pairs for training, validation, and testing, respectively. These stereo pairs contain daily scenes like human portraits and objects taken in various indoor and outdoor environments (e.g., library, office, playground, park). All the collected images are rectified and resized to , their groundtruth disparity maps are unknown. Besides, we use the FlyingThings3D dataset for synthetic training examples in our method, they are also resized to . Since their original size is , their disparity maps need to be rescaled by a factor of 0.8. To cater for the small disparity values of the smartphone data, we only keep those synthetic examples with maximum disparity no greater than 80 after rescaling, leading to 9619 available examples. Among them, examples are used for training and the rest are withheld for testing. We call this set of data FlyingThings3D80. In our experiments, all stereo pairs have intensity ranges from 0 to 255.
Dataset  Metric  [t]  

Tonioni [34]  DispNetC80  ZOLES  ZOLE [t]  
Smartphone  PSNR  SSIM  22.92  0.845  21.99  0.790  22.39  0.817  22.84  0.851  23.12  0.855 [t] 
FlyingThings3D80  EPE  3ER  1.08  6.79%  1.03  5.63%  0.93  5.11%  1.10  6.88%  1.11  6.54% [t] 
The Caffe framework
[16] is employed to implement our method. During training, we randomly crop the images to before passing them to a CNN, and let the patch size be for building the graphs, resulting in graphs for each training example. We modify the L1 loss layer of [23] to capture the first two terms of (5): for a synthetic pair, its L1 loss is weighted by 1.2 times, otherwise the weight is 1. We empirically set , and , all the computed are averaged then weighted by 1.5 times for a loss (the third term in (5)). We have tried out different upsampling ratios ’s ranging from 1.2 to 2 for computing , and found the the obtained CNNs have similar performance. In our experiments, we let . Data augmentation is introduced to the synthetic stereo pairs. For each individual view in a synthetic pair, Gaussian noise () are randomly added. The brightness of each image channel are also randomly adjusted (by a factor of ). We let the batch size be 6, the learning rate be , and finetune the model for iterations, validation is performed every 500 iterations.We first study the following models:

ZOLE: Generalize the pretrained model for smartphone stereo pairs with our method;

ZOLES: Remove graph regularization and simply let the CNN iteratively learn its own finergrain outputs;

DispNetC80: Finetune the pretrained model on the FlyingThings3D80 examples;

DispNetC: Released model pretrained with FlyingThings3D [23].
The very recent method [34] by Tonioni et al. also finetunes a pretrained model using stereo pairs from the target domain. They first estimate disparity maps for the target domain with ADCENCUS [36]. To finetune the model, they treat the obtained disparity maps as “groundtruths” while taking a confidence measure [29] into account. For comparison, we finetune a model with their released code under their recommended settings.
Since the stereo pairs of smartphones do not have groundtruth disparities, we evaluate the performance of a model in a way similar to the validation process presented in Section 4.3. We synthesize the left images with the estimated disparities and the right images, then measure the difference between the synthesized left images and the genuine ones, using both PSNR and SSIM as the difference metrics. For testing or validation, all the stereo pairs are fed to the CNN at a fixed resolution of . Figure 3 plots the performance of ZOLE, ZOLES and DispNetC80 on the validation set of the smartphone data during training (measured in terms of average PSNR of the synthesized left images). Besides, Table 2
presents the performance of all the aforementioned models, on both the test sets of the smartphone data and FlyingThings3D80. We use endpointerror (EPE) and threepixel error rate (3ER) as the evaluation metrics for the FlyingThings3D80 dataset. Compared to the models trained only with the synthetic data (DispNetC and DispNetC80), the one obtained with our method (ZOLE) achieves the best PSNR and SSIM performance. Figure
4 shows visual comparisons of four models on the test sets of the smartphone data. One can clearly see that, our approach leads to smooth disparities with very sharp details, while disparity maps produced by other models may be blurry or contain artifacts.Metric  [t]  

Tonioni [34]  DispNetC  ZOLES  ZOLE [t]  
EPE  1.27  1.64  1.34  1.25 [t] 
3ER  7.06%  11.41%  7.56%  6.76% [t] 
5.2 Driving Scenes of KITTI
Our selfadaptation method is also applied to generalize the pretrained DispNetC model to the KITTI stereo 2015 dataset [6], which contains dynamic street views from the perspective of a driving car. The KITTI stereo 2015 dataset have 800 stereo pairs. Among them, 200 examples have publicly available (sparse) groundtruth disparity maps. They are employed for testing, while the rest 600 pairs are used for validation. For training, we first gather stereo pairs randomly from the FlyingThings3D dataset. Since the KITTI 3D object 2017 dataset [6] have more than 10k stereo pairs of the same characteristics as KITTI stereo 2015, we randomly pick stereo pairs from it for training. During training, we adopt similar settings as presented in Section 5.1. However, in this scenario, all images are resized to then randomly cropped to before passing to the CNN for training.
We hereby compare our approach, ZOLE, with models obtained with ZOLES and [34]; while the original DispNetC model is adopted as a baseline. For a fair comparison, all the images are resized to before feeding to the network. Table 3 presents the objective metrics of ZOLE, along with those of the competing methods. We see that our method has the best objective performance, while the method of Tonioni et al. also provides a reasonable gain. Figure 5 shows several fragments of the resulting disparity images. One can see that our method provides accurate edges even for very fine details.
More results and discussions are provided in the supplementary material. Our method is essentially different from those deep stereo algorithms relying on leftright consistency for backpropagation [38, 39]. Hence, it is possible to combine our rationale—discriminatively learns from the finergrain outputs—with these methods to achieve further improvements. Moreover, the same rationale can be applied to other pixelwise regression/classification problems, e.g., optical flow estimation [15, 23] and segmentation [22]. We leave these research directions for future exploration.
6 Conclusion
Due to the deficiency of groundtruth data, it is difficult to generalize a pretrained deep stereo model to a novel domain. To tackle this problem, we propose a selfadaption approach for CNN training without groundtruth disparity maps of the target domain. We first observe and analyze two phenomena, namely, generalization glitches and scale diversity. To exploit scale diversity while avoiding generalization glitches, we let the model learn from its own finergrain output, while a graph Laplacian regularization is imposed to selectively keep the desired edges and smoothing out the artifacts. We call our method zoom and learn, or ZOLE for short. It is applied to two domains: daily scenes collected by smartphone cameras and street views captured from the perspective of a driving car.
References
 [1] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoderdecoder architecture for scene segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.

[2]
J. T. Barron, A. Adams, Y. Shih, and C. Hernández.
Fast bilateralspace stereo for synthetic defocus.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pages 4466–4474, 2015.  [3] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. In European Conference on Computer Vision, pages 611–625. Springer, 2012.
 [4] A. Elmoataz, O. Lezoray, and S. Bougleux. Nonlocal discrete regularization on weighted graphs: A framework for image and manifold processing. IEEE Transactions on Image Processing, 17(7):1047–1060, 2008.
 [5] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig. Virtual worlds as proxy for multiobject tracking analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4340–4349, 2016.
 [6] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3354–3361. IEEE, 2012.
 [7] A. Geiger, J. Ziegler, and C. Stiller. Stereoscan: Dense 3d reconstruction in realtime. In Intelligent Vehicles Symposium (IV), 2011 IEEE, pages 963–968. Ieee, 2011.
 [8] S. Gidaris and N. Komodakis. Detect, replace, refine: Deep structured prediction for pixel wise labeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5248–5257, 2017.
 [9] G. Gilboa and S. Osher. Nonlocal linear image regularization and supervised segmentation. Multiscale Modeling & Simulation, 6(2):595–630, 2007.
 [10] C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised monocular depth estimation with leftright consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
 [11] X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg. Matchnet: Unifying feature and metric learning for patchbased matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3279–3286, 2015.

[12]
M. Hein, J.Y. Audibert, and U. v. Luxburg.
Graph Laplacians and their convergence on random neighborhood
graphs.
Journal of Machine Learning Research
, 8(Jun):1325–1368, 2007.  [13] H. Hirschmuller. Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2):328–341, 2008.
 [14] W. Hu, G. Cheung, and M. Kazui. Graphbased dequantization of blockcompressed piecewise smooth images. IEEE Signal Processing Letters, 23(2):242–246, 2016.
 [15] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. FlowNet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
 [16] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia, pages 675–678. ACM, 2014.
 [17] A. K. Katsaggelos. Iterative image restoration algorithms. Optical engineering, 28(7):735–748, 1989.
 [18] A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry. Endtoend learning of geometry and context for deep stereo regression. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
 [19] Y. Kuznietsov, J. Stückler, and B. Leibe. Semisupervised deep learning for monocular depth map prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
 [20] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
 [21] X. Liu, G. Cheung, X. Wu, and D. Zhao. Random walk graph Laplacianbased smoothness prior for soft decoding of jpeg images. IEEE Transactions on Image Processing, 26(2):509–524, 2017.
 [22] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440, 2015.
 [23] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4040–4048, 2016.
 [24] P. Milanfar. A tour of modern image filtering: New insights and methods, both practical and theoretical. IEEE Signal Processing Magazine, 30(1):106–128, 2013.
 [25] S. Osher, M. Burger, D. Goldfarb, J. Xu, and W. Yin. An iterative regularization method for total variationbased image restoration. Multiscale Modeling & Simulation, 4(2):460–489, 2005.
 [26] J. Pang and G. Cheung. Graph Laplacian regularization for image denoising: Analysis in the continuous domain. IEEE Transactions on Image Processing, 26(4):1770–1785, 2017.
 [27] J. Pang, G. Cheung, A. Ortega, and O. C. Au. Optimal graph Laplacian regularization for natural image denoising. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2294–2298. IEEE, 2015.
 [28] J. Pang, W. Sun, J. S. Ren, C. Yang, and Q. Yan. Cascade residual learning: A twostage convolutional neural network for stereo matching. In ICCV Workshop on Geometry Meets Deep Learning, Oct 2017.
 [29] M. Poggi and S. Mattoccia. Learning from scratch a confidence measure. In BMVC, 2016.
 [30] T. S. Rappaport et al. Wireless communications: Principles and practice, volume 2. prentice hall PTR New Jersey, 1996.
 [31] O. Ronneberger, P. Fischer, and T. Brox. UNet: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pages 234–241. Springer, 2015.
 [32] D. Scharstein, H. Hirschmüller, Y. Kitajima, G. Krathwohl, N. Nešić, X. Wang, and P. Westling. Highresolution stereo datasets with subpixelaccurate ground truth. In German Conference on Pattern Recognition, pages 31–42. Springer, 2014.

[33]
D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst.
The emerging field of signal processing on graphs: Extending highdimensional data analysis to networks and other irregular domains.
IEEE Signal Processing Magazine, 30(3):83–98, 2013.  [34] A. Tonioni, M. Poggi, S. Mattoccia, and L. Di Stefano. Unsupervised adaptation for deep stereo. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
 [35] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox. Demon: Depth and motion network for learning monocular stereo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
 [36] R. Zabih and J. Woodfill. Nonparametric local transforms for computing visual correspondence. In European Conference on Computer Vision, pages 151–158. Springer, 1994.
 [37] J. Zbontar and Y. LeCun. Stereo matching by training a convolutional neural network to compare image patches. Journal of Machine Learning Research, 17(132):2, 2016.
 [38] Y. Zhong, Y. Dai, and H. Li. Selfsupervised learning for stereo matching with selfimproving ability. arXiv preprint arXiv:1709.00930, 2017.
 [39] C. Zhou, H. Zhang, X. Shen, and J. Jia. Unsupervised learning of stereo matching. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.