The use of deep convolutional networks has recently advanced the accuracy of stereo matching algorithms considerably [1, 2, 3, 4]. This improvement has been facilitated by the emergence of sizeable training sets, such as the KITTI datasets for autonomous driving , the new version of the Middlebury dataset , and, most recently, synthetic datasets of high quality 
. The use of machine learning allows to tune the stereo matching process to handle characteristic image patterns. This allows to resolve various stereo ambiguities using semantic cues, surpassing the accuracy of more traditional approaches that use low-level cues and priors.
Despite this success of deep learning methods in stereo, designing real-time algorithms as required by the majority of applications has proved challenging. The initial approach of  required over a minute to process a KITTI stereo-pair. A more recent (“fast”) variant discussed in the follow-up work  and a similar approach of  have brought down this time to as little as one second, which is still excessive for many applications.
The computational bottleneck within methods [1, 3, 2] is in the matching process of high-dimensional descriptors of local appearance, which has to be done for all pairs of potentially matching pixels across the two views. The most recent method  streamlines this necessity by proposing a deep architecture that directly outputs the disparity map given the stereo-pair as an input. While the high-dimensional descriptors still have to be implicitly matched within their architecture, this matching process only happens at low resolution, while further processing results in the efficient upsampling.
Here, we propose a new way to apply deep learning in order to improve the accuracy of stereo matching. In order to achieve real-time frame rate, we avoid learning high-dimensional descriptors and matching them and focus our learning-based effort on the cost-aggregation process. We thus use a simple linear combination of two classical and very fast similarity measures based on census transform  and sum-of-absolute-differences matching to define the overall matching costs for various pixels and disparities.
To perform cost-aggregation, we smooth the obtained noisy matching costs using one of the fastest edge-preserving smoothing techniques, namely domain transform [9, 10] across the four directions. Crucially, we make the parameters of this cost-aggregation process spatially-varying and use a deep convolutional network to predict them on a per-pixel basis. Such prediction facilitates smoothing across parts belonging to same object and prevents smoothing across object boundaries. At test time, the deep learning module processes only one of the input images, and the complexity of this module is thus independent of the disparity range.
Our experiments demonstrate that a combination of a simple matching process and a trainable domain transform-based cost aggregation is able to achieve a unique combination of a high frame-rate (e.g. 29 fps for KITTI 2015 dataset) and high matching accuracy (state-of-the-art for real-time methods). The high accuracy is obtained via the end-to-end learning process that takes into account the pixel-level matching, the cost aggregation, and the final winner-takes-all disparity selection. The ultimate accuracy greatly benefits from initializing the weights within our deep network to the weights of the network trained to detect natural boundaries in images .
Ii Related work
Our work is related to a large body of works on fast stereo matching that investigate the use of efficient algorithms such as few rounds of message passing , bilateral filtering , guided filtering , or domain transform  in order to achieve smoothness in the reconstructed depth maps. The domain transform used in our approach and introduced in  can be regarded as a fast approximation to bilateral filtering and is overall the fastest of the global aggregation methods employed within this class of stereo methods.
Our approach has been inspired by the recent work of  that establishes the connection between the domain transform and the gated recurrent neural networks (such as LSTM  and GRU ) and then uses this connection to discriminatively train domain transform parameters for the task of semantic segmentation. Our approach is reminiscent of other works on semantic segmentation such as  and  that also draw connections between recurrent neural networks and conditional random fields, and use these connections to impose spatial smoothness. Very recently (and concurrently) to our work, several groups have considered the use of learnable edge-aware smoothing for several image processing operations, including post-processing of depth maps [20, 21].
As discussed above, our work is also related to preceding approaches that use deep learning for stereo. Our approach differs markedly from [1, 3, 2] as we use deep learning within the cost aggregation rather than to compute the matching costs themselves. Unlike [1, 3, 2] and similarly to  we also use end-to-end learning that encompasses all stages of depthmap computation within our method. Unlike , which uses a rather generic feed-forward convolutional network trained on a massive amount of synthetic stereo pairs, our method employs classical stereo matching algorithms such as census transform as modules within a more specific architecture that combines convolutional networks with a gated recurrent neural network module (which is equivalent to the domain transform operation).
We consider a dense stereo correspondence problem where a rectified pair of images and (the left and the right view images respectively) is given as an input and the goal is to assign each pixel in the left image a label from the set where is the maximum allowed disparity. Our approach consists of three steps: constructing the cost volume, the cost-volume aggregation and the winner-takes-all label selection. As a postprocessing we also apply left-right consistency check to identify and fill in the occluded parts. We now discuss these steps in detail.
Iii-a Computing stereo-matching costs
In our method we operate on the cost volume explicitly stored as a three-dimensional array with dimensions where and are the image dimensions.
Following the pipeline of a typical local stereo method, we compute the following stereo matching cost based on two terms. The first term is based on the sum of absolute differences (SAD). We use 1x1 patches (i.e. individual pixels) for the sake of preserving maximum amount of texture information and rely on the subsequent cost aggregation scheme for further smoothing:
The second term is based on matching of the local census transform  which is a non-parametric local transform that relies on the relative order of intensity values. We convert images to gray-scale to compute this term. The local image structure at each patch is summarized by binary bits. Each bit is set according to the comparison of the patch’s central pixel intensity with the remaining pixels of the patch. The obtained descriptors are then matched with the hamming distance when computing the stereo matching costs. The census transform matching computation can be naturally mapped to a data parallel pipeline thus making it very efficient for modern GPU architectures. Each thread is assigned to a single pixel computation which is performed in constant time. Finally we combine the two terms into the final matching cost:
Where and is a constant coefficient that controls the ratio between the two cost values.
Iii-B Spatial boundary detector
Similar to  we perform cost-volume aggregation using a spatially-varying smoothing process. We rely on deep convolutional network and the end-to-end training to set the smoothing in an optimal and problem-specific way. Thus, our task is to go beyond simple edge detection, as not all the edges on the images are aligned with the disparity transitions. Therefore, the accuracy of the method could increase during the supervised training.
Still, because of an inherrent connection between object boundaries and edge discontinuities, our method for smoothing weight computations is based on the CNN-based architecture for object boundary detection proposed in . This architecture is thus embedded into the end-to-end learning process. The method  addresses the challenging ambiguity in edge and object boundary detection by learning rich hierarchical representation. It combines the edges detected at different image scales into a final edges map. To achieve this combination, their architecture includes skip connections merging together edges predicted at multiple scales (see fig. 1). Thus the final loss is informed about the edges predictions from the range of five scales.
Since our stereo method requires prediction of two edge maps (see the following sections for details), we slightly modify the original architecture in the following way. The amount of feature maps at each of the side output is increased from one to eight, each of them is further upsampled, so that the last 1x1 convolutional layer is able to produce two-channel output.
Iii-C Smoothing with a domain transform
Our cost-volume aggregation scheme is based on the domain transform method originally proposed for the edge-preserving image filtering  and further used in a neural network pipeline  in the context of semantic segmentation.
The domain transform method is originally used in order to perform fast edge-aware filtering of an image guided by a gradient information. It can be regarded as an instance of a fast bilateral filter . The domain transform takes two inputs:(1) the input signal , and (2) is the map of weights . The output of the domain transform is a filtered signal . For a 1-D signal the output is given recursively by the following recursive relation. After setting , for .
The set of weights is used in order to control the amount of smoothing along the signal yielding a way to preserve the edges by controlling the magnitude of . Indeed in the regions of close to one, the maximum amount of information propagates from the previous pixel to the current pixel . On the contrary, if is small (e.g. in the regions of large signal gradient), the output equals to the input, i.e. .
Domain transform filtering for 2-D images works in a separable way, using 1-D filtering sequentially along each dimension: a horizontal pass (left-to-right and right-to-left) along each row is followed by the vertical pass along each column (top-to-bottom and bottom-to-top). For the reasons described in the subsection III-D we use distinct weight maps and for the horizontal and the vertical passes respectively. We introduce the following notation for a 2D domain transform which takes an image and the two weight maps and computes the filtered image :
The procedure can be formally described as the sequence of the four recurisive passes (left-right, right-left, top-bottom, bottom-top) computed in the following order:
In order to achieve our goal, the weights must be predicted based on the input image. The authors of 
proposed a piecewise-differentiable version of the domain transform in order to backpropagate the errors within the semantic segmentation pipeline. This approach is extended to the task of cost volume filtering as described in the following section.
In order to explain how backpropagation works for the 1D filtering process of (3) we assume the output is given as an input to the subsequent layer . So each sample of the output signal receives contribution of the gradients . In order to compute gradients of the inputs, we unroll the recurrence (3) in the reverse order, i.e. for :
Thus, the four passes of the domain transform can be combined into a learning pipeline where the recursive relation (3
) can be considered as an instance of the gated recurrent unit. Equation (3) defines DT filtering as a recursive operation. In fact, in fact there is a precise connection to GRU which was recently proposed for modelling sequential data . The value is related to GRU’s ”update gate” and is a ”candidate activation” .
Iii-D Cost-volume filtering
to a machine learning pipeline where the filter weights are predicted by the convolutional neural network.
The overall scheme of our algorithm is given in Figure 1. The cost volume filtering is performed in four directional passes of the domain transform. In order to use two-dimensional domain transform weights
for the three-dimensional array filtering, we simply replicate the 2D edge maps for each slice of the energy tensor. For each slice of the cost volume, , , the domain transform is computed independently as follows:
The cost-volume filtering is than followed by the classical winner-take-all strategy for disparity label selection.
The authors  use the shared weights for each of the four directional passes. We, however, use separate weight maps for the vertical and the horizontal passes as it is beneficial for our task due to the the fact that the rate and the statistics of disparity changes along the horizontal and the vertical directions can be quite different. Using two weight maps instead of one implies very small computational overhead during the test-time stereo processing, while increasing the accuracy.
In order to reduce the computational complexity of our algorithm we train the edge detector for the half resolution of the original image and then use bilinear interpolation in order to upsample the edge maps to the original size. While the cost volume is computed for the original image size. Indeed, there is sufficient amount of information in real-world imagery at half resolution in order to extract edges relevant to the disparity discontinuities. Thus, some additional run-time reduction is gained at the cost of little or no increase in the disparity error.
We use the exponential non-linearity to map the output of the convolutional network to the weights of the domain transform:
where is, once again, a tunable parameter, that affects the convergence speed of the training process.
In order to match the filtered cost volume with the ground truth disparity field, we represent each of the ground truth labels as a one-hot vector and use the cross-entropy soft-max loss to evaluate the final disparity field error, thus maximizing the log-probability of the correct displacement at every pixel (where the disparity is known at training time). We did not observe practical benefit from using narrow gaussian distribution instead of one-hot vector as proposed in.
The training in our method is end-to-end (fig. 4) as the disparity field error is backpropagated through the recurrent part (comprised of four differentiable domain transforms) to the weight-computing convolutional neural network, so that the network learns to find disparity edges suitable for the cost volume filtering.
Iv-a Data set
We evaluated our method on the KITTI 2015 public data set ., which is a collection of color image pairs taken from a car roof while driving in a European city. Each of the pair rectified and the ground truth is given in a form of a sparse disparity field obtained with LIDAR. The disparity field regions corresponding to cars were then refined to dense fields using geometrically accurate CAD models fit to the point clouds. Each image has dimensions pixels with relatively large displacement within pixels range. Computing the stereo correspondence is especially challenging around the reflecting surfaces especially car windshields and windows, textureless regions including homogeneous car bodies, and traffic forming thin regions surrounded by large disparity fields discontinuities. Whereas most of the disparity fields consist of slanted surfaces corresponding to the road surface.
Iv-B Left-right disparity check.
Since the ground truth disparity fields include occluded regions, it is necessary to perform the standard left-right check procedure in order to interpolate the disparity fields within those regions in order to obtain accurate results. Following the algorithm described in , we compute the disparity fields for both left-right and right-left image pairs and then interpolate the occluded and mismatched pixels.
Iv-C Details or learning
The data set was split into the training set (160 image pairs) and the validation set (40 images). The pre-processing including mean subtraction was adopted from the edge detection network.
The data term corresponds to the linear combination of SAD difference blocks with census transform, the value parameter is chosen experimentally. The parameter is set to and does not affect the final accuracy.
The original architecture was modified in two ways. First, the amount of feature maps in the convolutional layers was halved in order to decrease the run-time. Second, the amount of feature maps used for the final linear combination was changed from 1 to 8 (fig. 1).
We train the network to minimize the cross-entropy loss using ADAM  method. The learning rate is set to . We benefit from using the edge detection model pretrained to extract edges for BSDS data set . The network is pre-trained for a edge-detection task. The ground truth labels obtained from the manual annotation are provided. The combination of cross-entropy losses for different scales is performed as described in .
An example of the original HED edge detection output compared to the learned domain transform weights can be observed on the fig. 2. Although there is no general intuition on the learned weights, one can observe that some of the edges were suppressed during the training. The degree of smoothing for the horizontal and the vertical pass is different after the training.
The learning was implemented using the combination of Theano framework and fast CUDA kernels for the AD-census cost volume computation. The test time evaluation framework combines the edge detector implemented using CuDNN library with CUDA implementation of the the domain transform. We measure the runtime of our implementation on a PC with NVIDIA GeForce GTX Titan X GPU. Training takes about 4-5 hours.
|stage||# of calls||total runtime, msec|
The runtime of our method across the pipeline stages can be observed in the table I. The total run-time is 34 msec (29 frames per second) runtime per image pair including overhead associated with the left-right check. The greatest fractions of the run-time are spent for the domain transform and the edge detection CNN forward computation.
Iv-E Quantitative results
Our method achieves error on the KITTI 2015 data set. The predicted disparity fields for the first images of the test set can be observed on fig. 6. The most notable challenges for the method arguably are dense ground truth labels for the car bodies. The method also fails to predict disparity for some thin curved regions that are not aligned with the vertical or horizontal axes. Overall it is able to produce correct labels for most of the labels which correspond to slanted surfaces, e.g. the road surfaces.
We proposed a new method of computing dense stereo correspondences using convolutional neural network trained to aggregate the cost volume. The methods is based on the multi-scale edge detector using to extract relevant edges and the domain transform trained to perform the cost volume aggregation. The method was evaluated using KITTI 2015 data set and achieves error at the 29 frames per second rate (see fig. 6).
Our approach can be extended to incorporate other cost aggregation approaches such as semi-global matching  that can also be unrolled to recurrent neural networks for training. We are currently investigating this approach, as it can potentially lead to more accurate aggregation, especially at slanted surfaces.
-  J. Zbontar and Y. LeCun, “Computing the stereo matching cost with a convolutional neural network,” in
-  W. Luo, A. G. Schwing, and R. Urtasun, “Efficient deep learning for stereo matching,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5695–5703.
-  J. Zbontar and Y. LeCun, “Stereo matching by training a convolutional neural network to compare image patches,” Journal of Machine Learning Research, vol. 17, pp. 1–32, 2016.
N. Mayer, E. Ilg, P. Häusser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox, “A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition CVPR, 2016.
-  A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
-  D. Scharstein, H. Hirschmüller, Y. Kitajima, G. Krathwohl, N. Nešić, X. Wang, and P. Westling, “High-resolution stereo datasets with subpixel-accurate ground truth,” in German Conference on Pattern Recognition. Springer, 2014, pp. 31–42.
-  G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. Lopez, “The SYNTHIA Dataset: A large collection of synthetic images for semantic segmentation of urban scenes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition CVPR, 2016.
-  R. Zabih and J. Woodfill, “Non-parametric local transforms for computing visual correspondence,” in European conference on computer vision. Springer, 1994, pp. 151–158.
-  E. S. Gastal and M. M. Oliveira, “Domain transform for edge-aware image and video processing,” in ACM Transactions on Graphics (TOG), vol. 30, no. 4. ACM, 2011, p. 69.
-  C. C. Pham and J. W. Jeon, “Domain transformation-based efficient cost aggregation for local stereo matching,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 23, no. 7, pp. 1119–1130, 2013.
-  S. Xie and Z. Tu, “Holistically-nested edge detection,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1395–1403.
-  H. Hirschmuller, “Accurate and efficient stereo processing by semi-global matching and mutual information,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 2. IEEE, 2005, pp. 807–814.
-  K.-J. Yoon and I. S. Kweon, “Adaptive support-weight approach for correspondence search.” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 4, pp. 650–656, 2006.
-  C. Rhemann, A. Hosni, M. Bleyer, C. Rother, and M. Gelautz, “Fast cost-volume filtering for visual correspondence and beyond,” in Computer Vision and Pattern Recognition (CVPR), IEEE Conference on. IEEE, 2011, pp. 3017–3024.
-  L.-C. Chen, J. T. Barron, G. Papandreou, K. Murphy, and A. L. Yuille, “Semantic image segmentation with task-specific edge detection using cnns and a discriminatively trained domain transform,” arXiv preprint arXiv:1511.03328, 2015.
S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
-  S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr, “Conditional random fields as recurrent neural networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1529–1537.
-  A. G. Schwing and R. Urtasun, “Fully connected deep structured networks,” CoRR, vol. abs/1503.02351, 2015.
-  J. T. Barron and B. Poole, “The fast bilateral solver,” in European conference on computer vision (ECCV), 2016.
-  S. Liu, J. Pan, and M.-H. Yang, “Learning recursive filters for low-level vision via a hybrid neural network,” in European Conference on Computer Vision. Springer, 2016, pp. 560–576.
-  K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio, “On the properties of neural machine translation: Encoder-decoder approaches,” arXiv preprint arXiv:1409.1259, 2014.
-  C. Cigla, “Recursive edge-aware filters for stereo matching,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2015, pp. 27–34.
-  M. Menze and A. Geiger, “Object scene flow for autonomous vehicles,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
-  D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in Proc. 8th Int’l Conf. Computer Vision, vol. 2, July 2001, pp. 416–423.
-  T. T. D. Team, R. Al-Rfou, G. Alain, A. Almahairi, C. Angermueller, D. Bahdanau, N. Ballas, F. Bastien, J. Bayer, A. Belikov, et al., “Theano: A python framework for fast computation of mathematical expressions,” arXiv preprint arXiv:1605.02688, 2016.