Recent_SLAM_Research
Track Advancement of SLAM 跟踪SLAM前沿动态【IROS 2019 SLAM updated】
view repo
A robust solution for semidense stereo matching is presented. It utilizes two CNN models for computing stereo matching cost and performing confidencebased filtering, respectively. Compared to existing CNNsbased matching cost generation approaches, our method feeds additional global information into the network so that the learned model can better handle challenging cases, such as lighting changes and lack of textures. Through utilizing nonparametric transforms, our method is also more selfreliant than most existing semidense stereo approaches, which rely highly on the adjustment of parameters. The experimental results based on Middlebury Stereo dataset demonstrate that the proposed approach outperforms the stateoftheart semidense stereo approaches.
READ FULL TEXT VIEW PDFTrack Advancement of SLAM 跟踪SLAM前沿动态【IROS 2019 SLAM updated】
Humans rely on binocular vision to perceive 3D environments. Even though it is a passive system, our brains can still estimate 3D information more rapidly and robustly than many active or passive sensors that have been developed. One of the reasons is that brains can utilize prior knowledge to understand the scene and to infer the most reasonable depth hypothesis even when the visual cues are lacking. Recent advances in machine learning have shown that the brain’s discrimination power can be mimicked using deep convolutional neural networks (CNNs). Hence, one has to wonder how CNNs can be used to enhance traditional stereo matching algorithms.
Approaches have been proposed for generating matching cost volumes (a.k.a. disparity space images) using CNNs [21, 39, 42]
. While inspiring results are generated, these existing approaches are not robust enough for handling challenging and ambiguous cases, such as lighting changes and lack of textures. Heuristically defined postprocessing steps are often applied to correct mismatches. Our hypothesis is that the performance of CNNs can be noticeably improved if more information is fed into the network. Hence, instead of trying to correct mismatches as postprocessing, we introduce, in the preprocessing step, image transforms that are robust against lighting changes and can add distinguishable patterns to textureless areas. The output of these transforms are used as additional information channels, together with grayscale images, for training a matching CNN model.
The experimental results show that the model learned can effectively separate correct stereo matches from mismatches so that accurate disparity maps can be generated using the simplest WinnerTakeAll (WTA) optimization.
Learningbased approaches were also proposed to compute confidence measure for generated disparity values so that mismatches can be filtered out [4, 32, 39]. Following this idea, a second CNN model is designed to evaluate the disparity map generated through WTA. Trained with only one input image and the disparity map, this evaluation CNN model can effectively filter out mismatches and produce accurate semidense (a.k.a. sparse) disparity maps.
Figure 1 shows the pipeline of the whole process. Since both matching cost generation and disparity confidence evaluation are performed using learningbased approach, the algorithm contains very few handcrafted parameters. The experiment results on Middleburry 2014 stereo dataset [29] demonstrate that the present dualCNN algorithm outperforms most existing sparse stereo techniques.
Stereo matching algorithms can be briefly categorized into two classes: dense and sparse stereo matching [18, 30]. Dense stereo matching algorithms assign disparity values to all pixels, whereas sparse matching approaches only output disparities for pixels with sufficient visual cues. A typical pipeline implemented in most dense and sparse stereo matching algorithms consists of matching cost computation, cost aggregation, disparity computation and optimization, and refinement steps [30]. Here, we focus our discussion on sparse stereo matching.
An early work of sparse stereo matching was implemented by calculating disparities for distinctive points first and gradually generating disparity values for the remaining pixels [19]. Graph cuts was later introduced in [37] to detect textured areas as an alternative to unambiguous points and generate corresponding semidense results. By design, their approach can filter out mismatches caused by lack of textures, but not by occlusions, etc. This limitation is addressed in SemiGlobal Matching (SGM) [11], in which multiple 1D constraints were used to generate accurate semidense results based on peak removal and consistency checks. Gong and Yang [7] proposed a reliability measure to detect potential mismatches from disparity maps generated using Dynamic Programing (DP). This work was later extended and implemented on graphics hardware for realtime performance [8]. Psota et al. [28] utilized Hidden Markov Trees (HMT) to create minimum spanning trees based on color information which allows aggregated costs to be passed along the tree branches, and the isolated mismatches were later moved by median filtering.
Generally, the above algorithms utilize additional constraints in the cost aggregation or/and disparity computation step to improve the accuracy of sparse stereo matching. Instead of designing new constraints or assumptions, we train CNNs for both generating aggregated cost volumes and detecting potential mismatches. The disparity computation step, on the other hand, is performed using the simplest WTA approach.
Traditionally, sum of absolute differences (SAD), sum of squared differences (SSD), and normalized crosscorrelation (NCC) had been commonly used for calculating and aggregating matching costs [30]. These windowbased matching techniques that rely on the local intensity values may not behave well near discontinuities in disparity. Zabih and Woodfil [40] therefore proposed two nonparametric local transforms, referred to as rank and census transforms, to address the correspondences at the boundaries of objects. A recent attempt tried to combine different windowbased matching techniques for stereo matching [1].
In the past few years, various works were proposed to generate matching cost volumes that can better differentiate correct matches from mismatches. The most encouraging direction is using groundtruth data [10, 34, 36, 41] to train various neural networks to learn local image features. More recent works mostly opted for CNNs trained by groundtruth data to predict the likeness for each potential match based on fixed windows as in [42]. The matching costs were set using the output of CNNs directly.
Due to the successful practice of using CNNs, stereo matching algorithms have been progressively improved over the past three years. Zhang et al. [43] used CNNs and SGM to generate initial disparity maps and further combine Left Right Difference (LRD) [12]
with disparity distance regarding local planes to perform confidence check. In addition, they adopted segmentation and surface normal within the postprocessing to enhance the reliability of disparity estimation. To fully utilize the ability of CNNs in terms of feature extraction, Park and Lee
[21] proposed a revised CNN model based on a large pooling window between convolutional layers for wider receptive fields to compute the matching cost, and they performed similar postprocessing pipeline introduced in [42]. Another model revision, similar to Park and Lee’s work [21], was introduced by Ye et al. [39], which used a multisize and multilayer pooling scheme to take wider neighboring information into consideration. Moreover, a disparity refinement CNN model was later demonstrated in their postprocessing to blend the optimal and suboptimal disparity values. Both the above revisions presented solid results in image areas with low or devoid of texture, disparity discontinuities and occlusions.Attempts were also made to train endtoend deep learning architectures for predicting disparity maps from input images directly, without the needs of explicitly computing the matching cost volume
[3, 14, 20]. As a result, these endtoend models are efficient but require larger amount of GPU memory than the previous patchbased approaches. More importantly, these models were often trained on stereo datasets with specific image resolutions and disparity ranges and hence, cannot be applied to other input data. They also restrict the feasibility of training CNNs to concurrently preserve geometric and semantic similarity proposed in [5, 31, 38].Once dense disparity results are generated, confidence measures can be applied to filter out inaccurate disparity values in the disparity refinement step. Quantitative evaluations on traditional confidence measures were presented by Hu and Mordohai [12], and the most recent review was given by Poggi et al. [27]. We hereby mainly presented a few significant works.
Haeusler et al. [9] proposed a random decision forest framework, which combines multivariate confidence measures to improve error detection. Park and Yoon [22] suggested a regression forest framework for selecting effective confidence measures. Based on SGM [11], Poggi and Mattoccia [24] used O(1) features and machine learning to implement an improved scanline aggregation strategy, which performs streaking detection on each path in [11] to perform confidence measure. They further proposed using a CNN model to enforce local consistency on confidence maps [26]. Recently CNNs have been applied to confidence measure. Our approach is similar to these recent works [4, 32, 39], which compute confidences through training 2D CNNs on 2D image or/and disparity patches. A key difference is, however, that only the left image and its raw disparity map generated by WTA are used to train our confidence CNN model, whereas existing approaches require to generate both left and right disparity maps.
We aim at developing a robust and learningbased stereo matching approach. We observed that, for many applications, it is more important to ensure the accuracy of output than to generate disparity values for all pixels. Hence, we here focus on assigning disparity values only for pixels with sufficient visual cues.
As shown in Figure 1, two CNN models, referred as matchingNet and evaluationNet, are utilized in our approach: matchingNet is constructed as the substitution of matching cost computation and aggregation steps, and outputs matching similarity for each pixel pairs; evaluationNet performs confidence measure on the raw disparity maps generated by WTA based on the similarity scores.
Our matchingCNN model serves the same purpose as the “MCCNNarct” in [42], but there are several key differences; see Figure 2. First of all, we choose to feed the neural network with additional global information (i.e., results of nonparametric transformations) that are difficult to generate through convolutions. Secondly, 3D convolution networks are employed, which we found can improve the performance. It is worth noting that our approach is also different from other attempts to improve “MCCNNarct”, which use very large image patches and multiple pooling sizes [21, 39]. These approaches require extensive amount of GPU memory, which limits their usage. In order to feed global information into the network trained on small patches, our strategy is to perform nonparametric transforms.
For robust stereo matching, lighting difference as an external factor cannot be neglected. To address this factor, “MCCNNarct” manually adjusted the brightness and contrast of image pairs to generate extra data for training. However, datasets with lighting difference may vary from one to another, making it hard to train a model that is robust against to all cases.
Aiming for an approach with less human intervention, here we propose using rank transform to tolerate lighting variations between image pairs. As a nonparametric local transform, rank transform was first introduced by Zabih [40] to achieve better visual correspondence near discontinuities in disparity. This endows stereo algorithms based on rank transform with the capability to perform similarity estimation for image pairs with different lighting conditions.
The rank transform for pixel in image is computed as:
(1) 
where is a set containing pixels within a square window centered at . is the size of set S. Figure 3 shows the results of rank transform under different window sizes.
Besides lighting variations, low or devoid of texture poses another challenge for stereo matching. For a given pixel within textureless regions, the best way to estimate its disparity is based on its neighbors who have similar depth but are in texturerich areas (have sufficient visual cues for accurate disparity estimation). Traditional stereo algorithms [30] mostly utilize cost aggregation, segmentationbased matching, or global optimization for disparity computation to handle ambiguous regions. As mentioned above, our intention is to feed the neural networks with global information. Hence, a novel companion transform is designed and applied in the preprocessing step.
The idea of companion transform is inspired by SGM [11], which suggests performing smoothness control by minimizing an energy function on 16 directions. In our case, we want to design a transformation that can add distinguishable features to textureless area. Hence, for a given pixel , we choose to count the number of pixels that: 1) have the same intensity as and 2) lie on one of the rays started from . We refer these pixels as ’s companions and the transform as companion transform. In practice, we found 8 ray directions (left, right, up, down, and 4 diagonal directions) work well, though other settings (4 or 16 directions) can also be used.
(2) 
where is a set containing pixels on the rays started from .
To train our CNN model, the 15 image pairs from Middleburry 2014 stereo training dataset [29], which contains examples for lighting variations and textureless areas, are utilized. Each input image is first converted to grayscale before applying rank and companion transforms. The outputs of the two transforms, together with the grayscale images, form multichannel images. Each training sample contains a pair of image patches centered at pixel in left image and in right image, respectively. The input sample is assembled into a 3D matrix , where is the size of the patches and is the number of channels in the multichannel image. The groundtruth disparity values provided by the dataset are used to select matched samples and random disparity values other than the ground truth are used to generate mismatched samples. Similar to [42], we sample matching hypotheses so that the same number of matches and mismatches are used for training. The proposed matchingNet is then trained to output value “0” for correct matches and “1” for mismatches.
For each novel stereo image pair, the matchingNet trained above is used to generate a 3D cost volume , where the value at location stores the cost of matching pixel in the left image with in the right image. The higher the value, the more likely the corresponding pair of pixels are mismatches since the network is trained to output “1” for mismatches. Unlike many existing approaches that resort to complex and heuristically designed cost aggregation and disparity optimization approaches [30], here we rely on the learning network to distinguish correct matches from mismatches. Expecting the correct matches to have the smallest values in the cost volume , the simplest WTA optimization is applied to compute the raw disparity map.
(3) 
The matchingnet is trained to measure how well two images patches, one from each stereo image, match. It makes decision locally and does not check the consistency among best matches found for neighboring pixels. When the raw disparity maps are computed by local WTA, they inevitably contain mismatches, especially in occluded and lowtextured areas. To filter out these mismatches, we construct another CNN model, evaluationNet, to implement consistency check and perform confidence measure.
Learningbased confidence measures have been successfully applied on detecting mismatches and further improving the accuracy of stereo matching [27]. Similar to the 2D CNN model for error detection proposed in [39], only left images and their disparity maps are selected to train our model. A key difference, however, is that no handcrafted operation is involved in our approach to fuse left and right disparity maps. In addition, the network contains both 2D and 3D convolutional layers to effectively identify mismatches from disparity maps; see Figure 6. 3D convolution is adopted here to allow the network learn from the correlation between pixels’ intensities and disparity values.
The evaluationNet is trained using both matches and mismatches in the estimated disparity maps for all training images. Mismatches are identified by comparing with groundtruth disparity maps . Here, a pixel is considered as mismatched iff.
(4) 
where is a threshold value commonly assigned with pixel; see Figure 7(bc).
In the estimated disparity map , the majority pixels have correct disparity values, resulting in much more positive (accurately matched) samples than negative (mismatch) samples. Hence, we collect and use all negative samples and randomly generate the same number of positive samples. For each selected sample , we extract grayscale and estimated disparity values from patches centered at to form a matrix. The evaluationNet is then trained to output value “0” for negative samples and “1” for positive samples. The output of the evaluationNet can then be used to filter out potential mismatches which achieve scores lower than a confidence threshold .
Training samples. Pixels in a given disparity map (a) is classified into mismatches (b) and accurate matches (c) using groundtruth disparity.
In this section, we present the “hyperparameters”
[42] for both of the proposed CNN models, which are followed by a set of performance evaluations. The goal of the evaluations is to find out: 1) whether the nonparametric transforms can help improving the disparity map accuracy generated using the matchingNet; and 2) how well the overall dualCNN approach performs compared to the stateoftheart sparse stereo matching techniques.matchingNet  evaluationNet  

Attributes  Kernel size, quantity  Stride size  Attributes  Kernel size, quantity  Stride size 
Input  , 1  Input  , 1  
Conv1(2D)  , 32  Conv1(2D)  , 16  
Conv2(3D)  , 128  Mp1  , 16  
Conv3(3D)  , 64  Conv2(2D)  , 32  
FC1  1600  Mp2  , 32  
FC2  128  Conv3(2D)  , 64  
Output  2  Mp3  , 64  
    Conv4(3D)  , 128  
    Mp4  , 128  
    FC1  128  
    Output  2  

Hyperparameters of the matchingNet and evaluationNet. Here, “Conv”, “Mp” and “Fc” denote convolutional layer, max pooling layer, and fully connected layers respectively.
Hyperparameters and implementations: The input of the matchingNet is a 3D matrix that consists of layers in our experiment. Both left and right images contains layers, including the grayscale image, a rank transform (), and a companion transform () respectively. Different layers from the left and right images are stored in the matrix in alternating order. For the evaluationNet, the input contains only two layers of data: one is the grayscale image and the other the raw disparity map, both from the left image. Table 1 shows the hyperparameters of our experimental models.
The implementation of our CNN models are based on Tensorflow using classification crossentropy loss,
, where denotes the output value. Here, we set for mismatches and for matches to train the matchingNet as in “MCCNNacrt”, but for positive samples andfor negative samples to perform confidence measure through the evaluationNet. Both models utilize a gradually decreasing learning rate from 0.002 to 0.0001, and arrive a stable state after running 20 epochs on full training data.
Effectiveness of nonparametric transforms: The overall structures of “MCCNNacrt” and matchingNet are quite similar. The key difference is that the input patches of “MCCNNacrt” are grayscale images only, whereas our matchingNet uses additional nonparametric transforms. Hence, to evaluate the effectiveness of nonparametric transforms, we here compare the raw disparity maps generated by the two approaches. Based on the same training dataset from Middlebury [29], Figure 8 visually compares the raw disparity maps generated by “MCCNNacrt” and matchingNet. It suggests that the additional transforms allow the network to better handle challenging cases. Our raw disparity maps achieves compared to of “MCCNNacrt” regarding the mean percentage error (MPE) (over pixel difference for half resolution) of nonocclusion areas.
Comparison with sparse stereo matching approaches: Almost all stateoftheart sparse stereo matching approaches have submitted their results to Middlebury evaluation site [29]. Our approach (referred as “DCNN”) on “test sparse” currently ranks the under the “bad 2.0” category. We would like to emphasize that simply comparing error rates of sparse disparities maps does not offer the whole picture on algorithm performance as it favors approaches that output fewer disparity values (a.k.a. more invalid pixels). For a fair comparison, a nonocclusion error rates v.s. invalid pixels rates plot is used to show the performance of different approaches on both the training and testing datasets; see Figure 9. The comparison suggests that our approach under setting provides a very good balance between output disparity density and disparity accuracy. In addition, the plot on the training dataset also shows that, under the same output disparity density, our approach provides lower nonocclusion error rates than existing approaches. Figure 10 further visually compares the disparity maps generated by different approaches.
The rootmeansquare (RMS) metric [29] is also used here for evaluation. Since square errors are used, the RMS metric provides stronger penalization to large disparity errors than the average absolute error (“avgerr”) metric. Our approach on the testing dataset currently ranks on the top under the “rms” category; see Table 2.
Name  RMS  Name  RMS 

DCNN  MPSV [2]  
RNCC(unpublished)  INTS [13]  
IDR [17]  SGM [11] 
AUC evaluation: The Area Under the Curve (AUC) metric introduced by Hu and Mordohai [12] has been used as a metric for evaluating various confidence measures over the past few years. It measures how effectively the confidence measures can filter out mismatches under different parameter settings, rather than only checking the performance under one set of parameters. Since a large set of sparse disparity maps need to be evaluated, this measure can only be computed on datasets with published ground truth. Following the practice in [35], we train our dualCNN approach only on the 13 additional image pairs with ground truth from Middlebury [29] and then test it on the 15 training image pairs. Our approach achieves a competitive mean AUC value compared to 0.0728, 0.0680 and 0.0637 attained respectively by the stateoftheart approaches APKR [16], O1 [24] and CCNN [25] reported in [35], which compares various confidence measures on the raw disparity maps from [42].
A novel learning based semidense stereo matching algorithm is presented in this paper. The algorithm employs two CNN models. The first CNN model evaluates how well two image patches match. It serves the same purpose as “MCCNNacrt”, but takes additional rank and companion transforms as input. These two transforms introduce global information and distinguishable patterns into the network; and hence areas with lighting changes and/or lack of textures can be more accurately matched. As a result, the optimal disparity values can be computed using the simplest WTA optimization. No complicated global disparity optimization algorithms or additional postprocessing steps are required. The second CNN model is used for evaluating the disparity values generated and filter out mismatches. Taking only one of the stereo images and the disparity map as input, the evaluationNet can effectively label mismatches, without the needs for heuristically designed process such as leftright consistency check and median filtering.
Our work suggests that, once sufficient information is fed to the network, CNNbased models can effectively predict the correct matches and detect mismatches. For the future work, we plan to investigate how to reduce the training and labeling costs so that the algorithm can be applied to realtime applications. We also plan to apply the algorithm to multiview stereo matching for 3D reconstruction applications.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 5410–5418, 2018.
Comments
There are no comments yet.