Disparity estimation is an important problem in low-level vision. Given two stereo rectified images, disparity refers to the relative horizontal displacement of two corresponding pixels in the left and right images. From dense disparity maps, we can estimate three dimensional geometry, which is critical for many computer vision applications, including autonomous vehicle navigation and 3D model reconstruction.
Traditionally, dense disparity has been estimated using window-based correlation, with smoothing, occlusion and globally-optimal matching constraints applied [jang2011efficient, adhyapak2007stereo, kanade1991stereo, hirschmuller2008stereo]. However, it is difficult to hand-craft these constraints. Additionally, global optimization is not practical for real-time applications. Recently, with the help of CNN, stereo matching has greatly advanced. Meaningful features learned by CNN prove to be more effective than hand-crafted ones. More sophisticated architectures are able to estimate dense disparity through end-to-end training. This end-to-end disparity regression from stereo pairs requires a large amount of image pairs with ground truth disparities during training. However, there is currently no such real dataset available. Instead, models are pretrained on large synthetic datasets [mayer2016large, dosovitskiy2015flownet] and then fine-tuned on the real-world target dataset. With this training pipeline, recent papers [kendall2017end, yang2018segstereo, chang2018pyramid, liang2018learning] achieve an impressive error rate below 2% in the KITTI benchmark stereo matching task [geiger2012we, Menze2015object]. Still, there are challenges to training on synthetic data and testing on real data. In this paper, we focus on developing an unsupervised method to do stereo matching for dense disparity estimation to help overcome these challenges.
Despite advances in disparity estimation since the application of DNN, finding correspondences in regions of high specularity, occlusions or low-texture regions is still a challenging problem. These areas manifest themselves as noise or missing regions in the resulting disparity map. For example, in Fig. 1, the disparity for the center of the road is incorrect because it is an area of low-texture and it is hard to find correspondence in this region. We argue that more contextual semantic information is needed to determine accurate disparity in these challenging regions.
With the rise and success of object classification [he2016deep], a new task known as semantic segmentation has also gained popularity and benefited from access to large amounts of labeled data [lin2014microsoft, Cordts2016Cityscapes]. This problem moves beyond simple bounding boxes and attempts to assign every pixel in an image a semantic label. The dense nature of this problem is complimentary to the disparity estimation task. Moreover, segment embedding learned from semantic segmentation can provide further cues for estimating disparity within ill-posed regions, because disparity tends to be smooth within an object or segment. From this perspective, models for disparity estimation need to have a high-level understanding of objects or at least segments, so stereo matching is no longer a low-level vision problem. Here, we set out to exploit the connection between these two pixel labeling tasks – disparity estimation and semantic segmentation – to improve the performance for disparity estimation.
In this paper, we focus on unsupervised stereo matching guided with the semantic segmentation task. The main contributions of this paper are:
(1) We propose a model which outputs a disparity map and semantic segments simultaneously, and then both can be used to acquire 3D semantic information.
(2) We propose a structure which better fuses segment embeddings learned from the semantic segmentation task into the process of disparity estimation. Experiments show that this is helpful for disparity estimation.
(3) Our unsupervised model is able to achieve state-of-the-art results in the KITTI stereo vision benchmark dataset, and can also beat some supervised methods in certain regions.
Ii Related Work
Typical stereo matching pipelines consist of four steps: matching cost computation, cost aggregation, disparity estimation and refinement. Traditional methods either use local descriptors to find the matching points within a predefined window [joglekar2014image], or they minimize an energy function globally to get an optimal solution [hirschmuller2008stereo].
Supervised Disparity Estimation. Stereo matching has greatly advanced since CNN were applied to this task by zbontar2016stereo. That method was supervised, requiring large datasets with stereo images and disparity ground truth. With this supervised approach, after meaningful features are extracted from a deep Siamese architecture, a cost volume can be computed by simply concatenating features from both sides [zbontar2016stereo], dot products [luo2016efficient, zbontar2016stereo], a correlation function [mayer2016large]
, or by concatenating all potential corresponding feature vectors from both sides[kendall2017end]. Several other papers have also focused on using information from cost volumes. They proposed different methods and structures, including simple convolutional layers [mayer2016large], learning context from 3D convolution [kendall2017end], using a spatial pyramid pooling module to incorporate more global context [chang2018pyramid], a two-stage refinement structure [pang2017cascade] and two separate branches for small and large disparities [ilg2017flownet, liang2018learning]. In line with these suggestions, we form a five-dimensional cost volume by concatenating features from both sides and extracting information from it using 3D convolution. We then refine the initial disparity using extra information from segment embedding.
Although some large datasets are now available for training in stereo matching, the size of available datasets is still relatively small compared to popular datasets for classification and detection. For example, KITTI 2012 and KITTI 2015, the most popular datasets for the stereo matching task, contain no more than 400 stereo images for training. In cases like this, unsupervised stereo matching has gained attention because it does not require ground truth disparity for training. Because of this, we focus on unsupervised learning in our approach for the stereo branch of our network. This maximizes the flexibility of the training sources, which is important because stereo ground truth is difficult to obtain.
Unsupervised Disparity Estimation. Deep unsupervised stereo matching relies heavily on warping error [garg2016unsupervised, godard2017unsupervised, zhou2017unsupervised, luo2018unsupervised]. This error is measured as the visual difference between a warped image from one half of a stereo pair and the real image from the other camera in the stereo setup. End-to-end training has become popular recently thanks to differentiable bilinear sampling, which can be used to warp images [godard2017unsupervised]. Additionally, a smoothness loss and left-right consistency loss also help improve the quality of results [godard2017unsupervised, zhong2017self]. Although results of these unsupervised methods are reasonable, a large performance gap still exists between these approaches and supervised methods. In this paper, we mainly focus on unsupervised stereo matching, and seek to use supervised semantic segmentation to help narrow this gap.
Guided Disparity Estimation. Both supervised and unsupervised stereo matching methods still have difficulty estimating correct disparity in flat, reflective and occluded regions. Thus, recent papers have sought to leverage extra information such as object-level knowledge [guney2015displets] and segment embedding [yang2018segstereo]. Their results show that exploiting available high-level information is useful for improving performance on the task of dense disparity estimation.
In this paper, we propose a fused model for semantic segmentation and disparity estimation that does not require ground truth disparity maps. Our proposed method is most similar to SegStereo [yang2018segstereo], which was developed simultaneously with our approach. However, our methods differ in several important ways. We focus on unsupervised stereo matching, where segment embedding is not only fused into disparity estimation, similar to SegStereo, but also is used to regularize disparity in the loss. Additionally, SegStereo computes a correlation layer, which may lose information, but we form a cost volume retaining all features, which enables the network to learn more complete feature representations. With additional refinement on the initial disparity, the results of our model outperform SegStereo by over a 2% error rate.
We present a joint model for disparity estimation and semantic segmentation. These two tasks are highly coupled in the network, with the semantic segment embedding being directly fused into the refinement process for disparity estimation. The whole architecture of our model is illustrated in Fig. 2.
Iii-a Architecture for Disparity Estimation
A Siamese structure is used to process both the left and right images and generate high-level features for stereo matching and semantic segmentation. Features for the segmentation task come from deeper layers of the network than those used for the stereo matching task, as the former requires more contextual information than the latter. ResNet [he2016deep] structure is used in the Siamese structure. Each task corresponds to a branch in the network. In the disparity branch, the size of input features is 1/4 of original stereo images. We concatenate features for stereo matching from the left and right viewpoints, and this produces a five-dimensional cost volume. A 3D convolution is used to extract an initial disparity map from this volume. Finally segment embedding is fused to refine the initial disparity map.
Iii-B Cost Volume and Learning Context
After calculating left and right features for stereo matching, we form a cost volume by concatenating them. Every feature vector from one side is concatenated with all potential corresponding feature vectors from the other side. This results in a cost volume with a dimensionality of Batchsize (Max_disparity+1) Height Width Feature_size. We form both left and right cost volumes to calculate a disparity for both views. Unlike other methods that use dot product or other metrics to measure correlation between feature vectors, the five-dimensional cost volume here enables the network itself to learn better correlation metric during training.
To extract information from the cost volume, a 3D convolution filter loops over all three dimensions of height, width and potential disparity values. This step captures broader contextual information. Since 3D convolution is memory intensive, an encoder-decoder structure is used to reduce the memory footprint. We apply bilinear upsampling to the output to match the shape of the input images. Soft argmin is used to produce the initial disparity map from this intermediate result.
Iii-C Disparity Refinement
The initial disparity estimation contains too much noise and its accuracy is limited by error from poor matching in ill-posed regions, such as occluded, reflective and texture-less areas. However, the semantic segment embedding can be used to improve correspondence in those regions. Disparity in the ill-posed regions should have similar values as regions from the same semantic segment. Essentially, the same smoothness constraint that is often applied globally can more accurately be applied within object boundaries. To this end, after producing the initial disparity map, we use semantic segment information to refine the disparity. The residual structure of the refinement process is shown in Fig. 2.
After convergence, we assume the initial disparity is reasonable in most regions, so in the refinement stage we then focus on refining the disparity in ill-posed regions. The residual structure is used here and forces the model to learn this highly non-linear relationship in such regions. The initial disparity and the semantic segment embedding are concatenated as the input to later process. The output is then summed with the initial disparity to get the final estimation.
Iii-D Architecture for Semantic Segmentation
In both the KITTI and Cityscapes [Cordts2016Cityscapes] datasets, only the left image from the stereo pair is labeled with ground truth semantic segments. However, we perform semantic segmentation on both images in the pair. Left disparity is used to warp the right semantic segmentation to the left view, which is in turn regularized by the left labels during training. Similar to PSPNet [zhao2017pyramid]
, a PSP module is used to incorporate more contextual information from different scales. The size of input features to PSP module is 1/8 of original stereo image. In the PSP module, input features are downsampled into three different sizes using averaging pooling, at scales of 1/2, 1/4 and 1/8 of the original input size. This is followed by convolution with a 1x1 filter to reduce feature dimension. Different scales of features are then concatenated after they are upsampled to the shape of the input feature space through bilinear interpolation.
Iii-E Loss Function
For our approach, we pose stereo matching as an unsupervised problem. The object loss consists of three items that are defined as the following:
where supervises initial estimated disparity, supervises refined estimated disparity and supervises predicted semantic segments. We set , , , , , , , and during the training. The other terms are defined as follows:
Photometric loss (): Let and be the input left and right images, and and be the predicted left and right disparity maps. The warping function is able to warp image to the other view based on the disparity map with bilinear sampling. The reconstructed left image is , and the reconstructed right image is . The reconstructed image should be very similar to the original input image. We use both Euclidean distance and a structure similarity term SSIM to improve the robustness in ill-posed regions [godard2017unsupervised]. For the left image, photometric loss is defined as follows:
where we set , , . These values were selected through experimentation.
Regularization loss (
): Regularization loss is used to smooth local disparity with information directly from input images, and we only use it in estimating initial disparity. We assume disparity in the local region tends to be smooth, so we add a regularization loss to suppress high frequency noise introduced by the photometric loss term. This regularization loss is the sum of the weighted second derivative of the disparity map, and the weight is the exponential of the second derivative of the input image. The higher the second derivative of the input image, the higher the probability of a change in disparity. For the left side, regularization loss is defined as follows:
where is number of pixels, and are second derivatives along the X and Y axes.
Consistency loss (): We can also synthesize a left image from the reconstructed right image and a right image from the reconstructed left image . Consistency loss is defined as follows:
This consistency forces the left and right branches to be consistent with one another [zhong2017self].
Smoothness loss (): For difficult regions, we argue that the network should be able to infer the disparity from its neighbors within a segment. Assuming initial disparity is reasonable, we use a left-right consistency check to find these regions. So we only include this loss in the refinement. We warp the right disparity using the left disparity , and we form a reconstructed image . Then we threshold the absolute difference between and :
where t is the threshold and is set to 3 during the experiment. Too large of a threshold will result in a trivial solution. In addition, the disparity should be smooth inside a segment. These segments are learned from the semantic segmentation task. Shallower layers are used here instead of the final semantic segmentation layer, biasing to smaller segments being learned. We apply a cost to enforce smoothness within a segment. For the left side,
where is feature vectors from the left view. This loss is only applied during refinement because it is conditioned on relatively good initial disparity.
Segmentation loss (
): Conventional softmax cross entropy loss is used to measure the difference between the logits map and the ground truth segment labels. For the stereo image dataset, only images from one side will be labeled. For example, KITTI and Cityscapes only have segment labels for the left images. However, the left disparity map will relate the left and right images. So we can use the left disparity map to warp the right output segments to the left, and then we can use the left ground truth label for supervision.
Iii-F Post Processing
Simple post processing can be used to improve the final results. Although the loss of smoothness can reduce the effects of occlusion, our model is still prone to error in those regions. Our post processing consists of two steps: left-right consistency check and interpolation.
After calculating both left and right disparity, we perform a left-right consistency check. For left view images, a pixel will fail the check if the difference between disparity values from the left view and the corresponding pixel from the right view is greater than a certain threshold. We set this threshold to 1, and we end up with a boolean mask. We also apply a median filter to this mask because it contains a fair degree of noise. Then, in these failure regions, we assign them disparity values from the background. As proposed by zbontar2016stereo, we interpolate by moving left until finding a position with a valid disparity and use this as its value. No further global optimization is applied.
Iv Experimental Evaluation
In this section, we explain our implementation details and present qualitative and quantitative results.
KITTI: KITTI 2012 and 2015 are two benchmark real-world driving datasets. They provide ground truth disparity computed from a calibrated high-resolution 3D LIDAR. There are approximately 200 rectified stereo images with ground truth disparity for evaluation in both KITTI 2012 and 2015. We primarily focus on the KITTI 2015 benchmark. Compared to KITTI 2012, challenging regions (e.g. car windshields) from KITTI 2015 are more correctly represented in the ground truth because it uses CAD models to produce disparity values for evaluation. Additionally, only KITTI 2015 contains ground truth for semantic segmentation. For evaluation, pixels are divided into two overlapping categories: strictly non-occluded regions (NOC) and all pixel regions (ALL). The KITTI 2015 benchmark considers a pixel to be ”correct” if the disparity error is less than 3 pixels and within 5% disparity error.
: Cityscapes is a dataset for semantic urban scene understanding. It contains 5,000 stereo color images collected from 50 cities, with high quality pixel-level ground truth semantic labels for the left view of each pair. These images are split into sets, with 2,975 for training, 500 for validation and 1,525 for testing. There are no ground truth disparity maps in the Cityscapes dataset, but disparity maps are provided using the SGM[hirschmuller2008stereo] algorithm.
Iv-B Implementation Details
In the experiments, we implement our architecture in TensorFlow. All experiments are run on a single NVIDIA Titan-X GPU. Original stereo images are normalized to values ranging from -1 to 1. Due to GPU memory limitation, we have a maximum batch size of 1, the maximum disparity is set to 192 and images are randomly cropped down to 256x512 patches before feeding into network. During optimization, we use the Adam optimizer[kinga2015method] with , and . The learning rate is set to for pre-training on Cityscapes and for fine-tuning on KITTI, and it is halved every iterations. Pre-training on Cityscapes is done for iterations. We then fine-tune the model on KITTI for an additional iterations. The finetuning process takes approximately 1 day. No data augmentation is performed in the experiments.
Here, we report the results of our model on the KITTI and Cityscapes datasets and compare our approach to other state-of-the-art methods.
Iv-C1 KITTI Benchmark
We report results on 40 validation images split from 200 training stereo images from KITTI 2015 to evaluate our model. We compare our model with other unsupervised learning methods in Table I. Note that our model outperforms other unsupervised methods by a notable margin. In the table, ’CS’ refers to training model on Cityscapes dataset, ’K’ refers to training model on KITTI and ’PP’ refers to refining disparity with post processing. With pre-training on the Cityscapes dataset and simple post processing, results of our model are further improved. In addition, Table II compares our method to other supervised approaches on the KITTI 2015 leaderboard. Although there is a gap between performance of current state-of-the-art supervised methods, our model achieves comparable results and even beats DispNet, a supervised method, on background regions. Sample results are shown in Fig. 3 (a).
We only show qualitative results from the Cityscapes dataset because it does not provide ground truth disparity maps. The results are shown in Fig. 3. Note that compared with the SGM approach, our model is able to generate much more complete and visually accurate disparity maps.
|Model||NOC pixels||All pixels|
|Zhou et al. [zhou2017unsupervised]||8.61||9.91|
|Godard et al. [godard2017unsupervised]||-||9.19|
|Luo et al. [luo2018unsupervised]||6.31||6.63|
|Ours(CS & K)||5.84||6.29|
|Ours(K & pp)||5.29||5.69|
|Ours(CS & K & pp)||5.20||5.67|
Iv-D Ablation Study on Loss Components
We perform ablation experiments to evaluate the different components of our developed loss function. Results of the ablation study are shown in TableIV. Models are trained and evaluated on the KITTI 2015 without pretraining or any post processing. The results of our model are improved due to the two-stage refinement, designed smoothness loss and incorporation of semantic segmentation supervision. Specifically, the error rate is reduced from 7.04 to 6.53 with designed smoothness loss and is further reduced from 6.53 to 5.93 with segment supervision. Fig. 1 shows a qualitative result. With semantic segmentation supervision, it corrects the wrongly estimated disparity on the center of the road, which is a region with high reflection.
Iv-E Performance Analysis
In Table III, we present details error rates on regions of each semantic segmentation class before and after adding smoothness loss and fusing segment embedding. We wish to delve into how segment embedding learned from semantic segmentation benefits disparity estimation. In the table, ’smo’ refers to smoothness loss; ’seg’ refers to segmentation loss; the first row shows the name of semantic classes; the second row shows error rates of the model with all components of losses except smooth loss and segmentation loss; the third row shows error rates of the model with all losses except segmentation loss; the fourth row shows error rates of the model with all losses; the final row shows percentage of error rate reduction after adding smooth loss and segmentation loss. As shown in the table, the smoothness loss helps improve disparity estimation for large semantic classes but not for small semantic classes. For example, error rates on regions of large semantic classes like roads, cars and buses decrease substantially, but error rates on regions of small semantic classes, such as poles, traffic lights and traffic signs, actually increase after imposing smoothness loss. This is because, without guidance of semantic segmentation, smoothness loss tends to blindly force local disparity smooth and disparities for small objects are smoothed to their neighbors which results in more error.
However, with supervision of the semantic segmentation task, the model is able to learn semantic features. In this case, disparity smoothness loss will force the disparity to be smooth within segments with the same semantic meanings rather than blindly with neighboring segments. Thus, disparities for small objects will remain coherent. It is shown in the table that error rates on regions of poles, traffic lights, traffic signs and other small semantic classes decrease to the lowest level after supervision of the semantic segmentation task.
The focus of this work is on improving state-of-the-art for unsupervised disparity estimation guided by semantic segmentation. We also evaluate our method on semantic segmentation performance. Our baseline IoU is 47.6%. After disparity refinement, segmentation performance decreases slightly to 46.9% when evaluating on 40 validation images from KITTI 2015. This suggests that the disparity loss forces features to be different even within a semantic class. Future work will focus on improving performance for the semantic segmentation.
|NOC pixels||All pixels|
Iv-F Qualitative Results: 3D Models
We triangulate the disparity maps with the camera extrinsics into 3D point clouds with semantic labels as shown in Fig. 4. We only consider pixels where disparities are above 5. Note that simultaneously calculating both disparity and semantic class enables us to efficiently produce semantic 3D models, which can be used more directly for driving tasks than other independent outputs.
We propose a model in which segment embedding learned from semantic segmentation is fused into the process for disparity estimation. This segment embedding is helpful for estimating disparity in ill-posed regions. We demonstrate the efficacy of our method on both KITTI and Cityscapes datasets. Our unsupervised method achieves comparable results to supervised methods on KITTI and even outperforms some of them in background regions. Outputting disparities and semantic segments simultaneously enables us to efficiently produce semantic 3D models. For future work, we are going to exploit instance segment labels, as instance segments have potential to provide further cues for object boundaries and finer details.
This work was supported by a grant from Ford Motor Company via the Ford-UM Alliance under award N022884.