Cost aggregation is a key component of stereo matching [scharstein2002taxonomy], which filters cost volume to rectify the mismatched pixels via the context information within the support window of each pixel. Most existing aggregation methods [marr1979computational, jen2011adaptive, hu2013comparisons, kendall2017end, chang2018pyramid, guo2019group, duggal2019deeppruner, wu2019semantic, zhang2014cross] usually incorporate multi-scale processing to adjust the receptive field of filters to provide appropriate context information for areas with different sizes. Under multi-scale processing, a cost volume filtered at coarse scale is needed to be upsampled to fine scale. Those methods simply upsample a cost volume through nearest-neighbor/bilinear interpolation or deconvolution that uses fixed mapping weights over the entire volume. However, it is difficult to use fixed weights of mapping to adaptively recover the details that are poorly represented at coarse scales.
In this paper, we present a novel method of content-aware inter-scale cost aggregation that adaptively aggregates and upsamples the cost volume from coarse-scale to fine-scale by learning dynamic filter weights according to the content of the left and right views on the two scales. Our method achieves reliable detail recovery when upsampling through the aggregation of information across different scales. In our method, content-aware weights of filter for each site in a fine-scale cost volume are learned to adaptively aggregate information and to amplify remaining details in a coarse-scale cost volume, see Figure 1 for the results achieved by our method.
Our method uses a novel strategy of effective content-aware weight learning that contains the following ideas: (i) We use left and right views, which we call as stereo views for convenience, to learn content-aware filter weights for cost aggregation. In contrast, existing content-aware filters for cost aggregation use only single-view information (i.e. features from the left view) to learn the weights of guided filters [zhang2019ga] or affinity kernels [cheng2019learning]. A cost volume is about 3D information from stereo views, as shown in Figure 2(a). Compared to the guidance based on 2D textural information from a single view, cost aggregation according to the guidance based on 3D information from stereo views can achieve better results. However, it is not straightforward to exploit stereo views for weight learning. A direct concatenation of feature maps from stereo views for weight learning cannot work well, since their unaligned contents will undermine each other’s guidance. We will explain later how we use stereo views. (ii) Each content-aware weight of a filter represents the similarity between two points with specific spatial relationship. Existing methods use a series convolutions with large receptive field on unitary feature input to learn the content-aware weights [jia2016dynamic, jo2018deep, cheng2019learning, wang2019carafe, liu2017learning, zhang2019ga, mazzini2018guided], which we call as implicit spatial relationship encoding, as shown in Figure 2(b). Differently, we explicitly encode the spatial relationship to learn the content-aware weights, including the shifting of feature maps from two scales and the construction of location map, which we call as explicit spatial relationship encoding. In more detail, the feature maps from two scales are shifted relatively according to each spatial relationship and then concatenated as the input for learning of specific weights. Location maps about the relative position information are also introduced into the input. As each content-aware weight is related to the specific spatial relationship of two pixels, it is better to learn this pairwise relationship from paired feature inputs, rather than only from unitary feature input. What’s more, it becomes possible to use only convolutions to generate the content-aware weights, which is more efficient than previous methods. The effectiveness of our strategy is shown in the ablation study.
Our method also uses a decomposition strategy to efficiently construct the 3D filter weights and aggregate the 3D cost volume. The mechanism avoids content-aware weight learning and aggregation in a full 3D spatial-disparity space that will cause huge computation cost. Instead of directly learning 3D filter weights for the aggregation on 3D cost volume, we only learn 2D similarities from adjacent scales for the left and right views. 3D content-aware weights are constructed on-the-spot from those 2D similarities via warping with corresponding disparity. Both memory and computation can be reduced significantly in such a way. To further reduce the computation cost, we split the inter-scale aggregation in a full 3D spatial-disparity space into the aggregation in 1D disparity space and the aggregation in 2D spatial space. As a result, our method could efficiently construct and use 3D content-aware weights for inter-scale cost aggregation.
Our contributions are summarized as follows:
We present a novel method of content-aware inter-scale cost aggregation that can not only upsample the cost volume for reliable detail recovery, but also aggregate the information across scales to improve the quality of disparity estimation.
We present a novel strategy of effective content-aware weight learning for inter-scale aggregation on 3D cost volume, by using 3D information from stereo views and by explicit spatial relationship encoding between two scales.
We present a novel mechanism of 3D filter weight construction and 3D cost aggregation decomposition, to efficiently construct and use 3D content-aware weights for inter-scale cost aggregation. Both memory consumption and computation cost can be reduced significantly.
2 Related Work
2.1 Multi-Scale Cost Aggregation
Most existing methods of multi-scale cost aggregation improve the stereo matching results via local context information with coarse-to-fine strategy [marr1979computational, hu2013comparisons, jen2011adaptive]. Zhang et al. [zhang2014cross] argue that the information from different scales is complementary to each other, and they present cross-scale cost aggregation that fuses the cost volume from all scales simultaneously to achieve inter-scale consistency. Recently, it is popular to embed cost aggregation into deep stereo network for ene-to-end learning. Kendall et al. [kendall2017end] first propose to incorporate cost aggregation into end-to-end deep stereo network via multi-scale 3D convolution. Chang and chen [chang2018pyramid] use three stacked hourglass network to enlarge the receptive field for multi-scale cost aggregation. Yu et al. [yu2018deep] propose an explicit cost aggregation to select the best cost proposal. In order to relieve the influence of challenging regions, Zhang et al. [zhang2019ga] formulate traditional SGM into deep network and propose a semi-global aggregation layer and a local guided aggregation layer. For better details and lower noises, Cheng et al. [cheng2019learning] propose to propagate the information within the disparity space and scale space. The above methods use fixed kernels/weights to map the cost volume to next scale. However, the fixed kernels/weights would cause serious loss of details, as it is difficult to use fixed mapping to selectively enhance the details poorly represented at coarser scales. In contrast, our work focuses on content awareness, which can naturally adapt the mapping to different areas and dynamically recover the details.
2.2 Guided Upsampling
In one-stage guided upsampling, Shi et al. [shi2016real] learn convolution filters to replace the handcrafted bicubic filter. Tian et al. [tian2019decoders] design a data-dependent upsampling to replace the bilinear operation, which learns an inverse linear projection via reconstruction error. Mazzini et al. [mazzini2018guided] point out that it is easy to predict wrong labels on object boundaries using vanilla upsampling. Thus they predict a high-resolution pixel-specific guidance offset table for upsampling. Wang et al. [wang2019carafe] argue that nearest neighbor and bilinear interpolation cannot capture the rich semantic information and propose to generate the mask from input to achieve instance-specific awareness.
In our work, we address the problem of fixed weights leading to the loss of details in cost volume mapping, and then build adaptive kernels to fit the mapping on different areas. Our work is different from aforementioned methods as following. (i) We explicitly encode spatial relationship to extract the mapping between fine-scale and coarse-scale feature maps as guidance, while previous guidance generation methods only depend on the filtering effects over the feature maps from single scales or the direct concatenation of feature maps. The superiority of explicit spatial encoding scheme is proved in our experiments. (ii) We further consider the stereo views in stereo matching and leverage the guidance from two views to achieve the content awareness, while prior methods only have one view. The effectiveness of stereo views is also verified in our ablation study. (iii) Previous guided upsampling methods are based on 2D feature maps. It would cost huge computation resources to directly apply them to 3D cost volume. Differently, our method is much more efficient, as we build the 3D kernels on-the-spot from 2D guidance and split 3D spatial-disparity space to 1D disparity space and 2D spatial space separately.
2.3 Content-Aware Filtering
Content-aware filters are usually built from a series of convolutions on the reference feature maps. Jia et al. [jia2016dynamic] propose to learn the filter conditioned on inputs. Zhang et al. [zhang2019ga] and Cheng et al. [cheng2019learning] learn the affinity along each direction via filtering effects as shown in Figure 2(b). Other methods, like super resolution [jo2018deep], segmentation [he2019adaptive] and depth upsampling [hui2016depth], also depend on the filtering effects to learn the weights along specific direction. In contrast, we explicitly encode pairwise spatial relationships into the input as presented in Figure 2 (b). The explicit scheme has lower computation cost than the above methods as we use shared convolutions with size of , while the above methods need different series of convolutions with size of at least . Also, instead of using the single left view, we learn the content-aware filter weights for cost aggregation via stereo views which are inherent in cost volume as shown in Figure 2 (a).
As shown in Figure 3(b), our method includes a guidance generation module and an inter-scale cost aggregation module, which upsamples the cost volume for reliable detail recovery and aggregates cross-scale information to improve the quality of disparity estimation.
3.1 Guidance Generation
For content awareness, we embed our method in a stereo network, like PSMNet [chang2018pyramid], and extract the mapping between features from two scales as guidance where the features come from shallow layers of the network. We also design an explicit spatial encoding scheme, including the shifting of feature maps from two scales and the construction of location map, to learn guidance efficiently for following inter-scale cost aggregation.
Suppose there are feature map in coarse scale and feature map in fine scale where represents the coarse scale, represents the fine scale and the ratio between two scales is . We first expand to by nearest upsampling that is an identity mapping. Then, and
are concatenated after shifting with respect to the current spatial relationship. More specifically, we achieve the shifting via zero-padding toand in opposite direction. For example, as shown in Figure 4, if the current window size is and the target position of weight in the window is the upper-right corner, we then add a zero-padding with padding size of to the along upper-right and to the along down-left. An adaptive location map which will be explained in detail below is also concatenated to introduce the relative position information. The guidance in direction are finally computed through three convolutions on the concatenation :
Taking ratio value and center direction where as an example, is designed as
where is a process repeating along rows and columns for times. Furthermore, the final guidance from different directions are regularized by softmax along direction dimension where represents the mapping from position at coarse scale to position at fine scale. Then, we define the entire guidance generation module as :
3.2 Inter-Scale Cost Aggregation
3.2.1 Full 3D Spatial-Disparity Space
In most stereo networks, the cost volume mapping is formulated as a weighted sum within the support window:
where , and are all 3D positions (spatial-disparity space) in cost volume, and are the cost volume at coarse scale and fine scale respectively, is the 3D window centered at and is the weight of neighbor . We use feature maps from two scales to explicitly determine the weights in different areas, instead of applying the same weights over the entire cost volume (such as bilinear interpolation and deconvolution). We replace the weight in Eq. 4 with the learned guidance :
where the 3D position and in Eq. 4 is unfold as and . In deconvolution and bilinear interpolation, the is filled with zeros to change its size before weighted summation, which means much zeros are appeared in the receptive field and provide no information. Differently, we directly sample from which means, given a fixed size window, our method achieves much larger receptive field than deconvolution and bilinear interpolation.
As aforementioned, we do not directly learn the content-aware 3D kernel. Instead, we build the kernel on-the-spot from 2D guidance and (: left view, right views):
where is estimated based on the features and , by a shared guidance generation module. We leverage the properties of cost volume in disparity dimension which corresponds to the pixel shift. In other words, the position in cost volume represents in left image and in right image. Through such a strategy, we improve the efficiency and reduce the computation cost compared to directly learn the 3D content-aware kernel.
3.2.2 1D Disparity Space
In the first step which we call as 1D disparity inter-scale cost aggregation, the positions in cost volume along disparity dimension correspond to in left image and in right image, where is the size of window . For the second step which we call as 2D spatial inter-scale cost aggregation, each in cost volume corresponds to in left image. More clearly, the 1D disparity inter-scale cost aggregation is formulated as
where , , , and are the feature maps respectively from left and right views.
3.2.3 2D Spatial Space
As to the 2D spatial inter-scale cost aggregation, it is formulated as
where . Through the above two-step splitting scheme, we achieve the transformation from to and then from to . We also greatly reduce the computation complexity in terms of FLOPs (floating-point operations) even compared to 3D de-convolution.
In this section, we evaluate our method on a synthetic dataset (Scene Flow [guney2015displets]), an outdoor dataset (KITTI 2015 [menze2015object]), and an indoor dataset (Middlebury-v3 [scharstein2014high]). We use baseline network PSMNet [chang2018pyramid]
to demonstrate the effectiveness of our method, by applying our method to replace the bilinear interpolation and deconvolution in the PSMNet. All models are implemented via PyTorch and optimized using the Adam algorithm withof 0.9, of 0.999 and batch size of 4. We use four NVIDIA GTX 1080Ti GPUs for training. During training, color normalization is applied to each input image which is then cropped to
resolution. We train our network on Scene Flow for 10 epochs with learning rate of 0.001. We then fine-tune the network on KITTI 2015 and set the learning rate to 0.001, 0.0001 and 0.00003 for the first 200 epochs, the next 400 epochs and an additional 600 epochs respectively. As for Middlebury-v3, we also fine-tune the model pre-trained on Scene Flow. The learning rate is set to 0.001 for 300 epochs and then changed to 0.0001 for the rest 600 epochs. In all the experiments, no post-processing or unsupervised learning is used.
|PSMNet (Bilinear Ours)||✓||0.61|
|PSMNet (Deconv Ours)||✓||0.76|
|PSMNet (All Ours)||✓||✓||0.57|
|w/o Spatial Relationship Encoding||0.67|
|w/o Stereo Views||0.63|
4.1 Ablation Study
An ablation study is conducted on the Scene Flow dataset to show the benefits of each sub-parts of our methods. We test the superiority of our method over bilinear interpolation and deconvolution. The effectiveness of our strategy for content-aware weight learning is verified by identifying the importance of explicit spatial encoding and the use of stereo views for weight learning. We also test a more challenging situation to demonstrate the power of our method.
4.1.1 Superiority Compared to Bilinear Interpolation
In order to show the superiority of our method over bilinear interpolation, we compare the result of the PSMNet with ours that is generated by replacing the bilinear interpolation in the PSMNet with our method. As depicted in Table 1, the PSMNet (Bilinear ours) improve the result of the PSMNet by 0.48 in EPE, which shows the superiority of proposed method over bilinear interpolation for cost volume upsampling.
4.1.2 Superiority Compared to Deconvolution
We also analyze the superiority of our method when compared to deconvolution. We replace the two deconvolutions in the PSMNet with our method. As shown in Table 1, the PSMNet (Deconv ours) outperforms the PSMNet by 0.33 in EPE, which shows the superiority of proposed method over deconvolution for inter-scale cost aggregation. We further test the performance of the PSMNet (All ours) which is used in the following benchmark performance comparison. Additional improvement can be observed when both bilinear interpolation and deconvolution are replaced.
4.1.3 Importance of Explicit Spatial Relationship Encoding
We verify the importance of using explicit spatial relationship encoding to learn the content-aware similarity/affinity weights. As shown in Table 2, the performance drops sharply if we use the feature input by a direct concatenation of feature maps from two scales without explicit spatial encoding. This situation can be viewed as unitary feature input by fusion of multi-scale feature maps.
4.1.4 Importance of Stereo Views
We verify the importance of using 3D information from stereo views to learn content-aware weight for the aggregation on 3D cost volume. As presented in Table 2, worse result is obtained by using the guidance based on only 2D textural information from the left view.
4.1.5 Performance in Challenging Situation
In addition to the above experiments, we test the performance of our method in a more challenging situation. Instead of upsampling the cost volume from to and then to , we directly transform the cost volume from ratio to . In this challenging situation, our method achieves accuracy of 0.89 in EPE which is still better than the result of PSMNet. Such results not only demonstrate the power of our method but also indicate its potential applications under resource constrained situations like mobile phones.
4.2 Benchmark Performance
|DispNetC [mayer2016large]||1.68||SSPCV-Net [wu2019semantic]||0.87|
|CRL [pang2017cascade]||1.32||DeepPruner-best [duggal2019deeppruner]||0.86|
|GCNet [kendall2017end]||1.84||GA-Net-15 [zhang2019ga]||0.84|
|PDS-Net [tulyakov2018practical]||1.12||CSPN [cheng2019learning]||0.78|
|EdgeStereo [song2018edgestereo]||1.11||GwcNet-gc [guo2019group]||0.77|
|StereoNet [khamis2018stereonet]||1.10||EdgeStereo V2 [song2019edgestereo]||0.74|
4.2.1 Scene Flow Dataset
Scene Flow is a large synthetic dataset containing 34,896 training images and 4,248 testing images with size of . This dataset has three rendered sub-datasets: FlyingThings3D, Monkaa, and Driving. FlyingThings3D is rendered from the ShapeNet dataset and has 21,828 training data and 4248 testing data. Monkaa is rendered from the animated film Monkaa and has 8666 training data. The Driving is constructed by the naturalistic, dynamic street scene from the viewpoint of a driving car and has 4,402 training samples.
We compare our model with other state-of-the-art methods on this dataset. As shown in Table 3, our model achieve the best end-point-error (EPE) 0.57, almost twice better than the baseline, PSMNet. We also visualize the results of PSMNet and ours. In Fig. 6, we can easily observe the significant improvements of our method not only in the fine-grained areas but also in the large textureless areas. For
In order to better visualize the capability of our method in detail recovery, we additionally supply a set of multi-scale disparity maps that are obtained from our method and PSMNet (see Figure 9 - 14). As the PSMNet only outputs the full-resolution result, we add the same regression model onto each cost volume at different scales, and retrain it on Sceneflow. From the multi-scale illustration (Figure 9 - 14), it is obviously that our method avoids overlapping the foreground and background, and achieve reliable recovery of details which are important for the high-quality 3D reconstruction and the detection of distant objects in autonomous driving.
|Models||Noc (%)||All (%)||time (s)|
|EdgeStereo V2 [song2019edgestereo]||1.69||2.94||1.89||1.84||3.30||2.08||0.32|
4.2.2 KITTI 2015 Dataset
Unlike the synthetic dataset SceneFlow, KITTI 2015 is a real-world dataset with street views from a driving car. It contains 200 training stereo image pairs with sparse ground-truth disparities obtained using LiDAR and another 200 testing image pairs without ground-truth disparities. We took 160 images for training and left 40 images for evaluation.
In order to show the performance of our model in a real outdoor scene, we test it on KITTI 2015 and compare it with state-of-the-art methods. As depicted in Table 4, we achieve competitive performance on KITTI 2015 benchmark. Although CSPN [cheng2019learning] is around better than ours in Noc-all and better than ours in All-all, the speed of our method is nearly 3 times faster than theirs. Furthermore, CSPN uses Nvidia P40 while our model are benchmarked on GTX 1080Ti. Apart from the quantitative analysis, a qualitative comparison between our model and PSMNet is also given in Figure 7. The differences between PSMNet and ours can be easily found in the above figures, especially in the black and white boxes we marked.
|Models||Res||bad 1||bad 4||avgerr||rms||A90||A95||time|
4.2.3 Middlebury-v3 Dataset
Middlebury-v3 [scharstein2014high] is collected in real world with static indoor scenes containing complicated and rich details. Its groundtruth disparities are acquired via structured light, which are both dense and precise. There are 15 stereo pairs for training and 15 stereo pairs for testing. Each pair is provided in three resolutions, full, half and quarter resolution, where we used the quarter resolution in experiment.
In order to better demonstrate the ability of our model in sophisticated and fine-grained indoor scenes, we compare the proposed model with state-of-the-art methods on Middlebury-v3. As we implement our method on the PSMNet and our GPU can only afford the computation cost of quarter resolution when running PSMNet, we mainly focus on the quarter resolution for a fair comparison. As illustrated in Table 5
, our method achieves the best performance. Compared to PSMNet, our method shows large accuracy improvements but has lower computation cost. For the remaining methods, we achieve better results in most metrics with the same or even the half of their input resolution. Data visualization is also supplied to better present the performance of our method. In Figure8, there are significant differences between ours and others, especially in the red boxes.
In this paper, we have presented a content-aware inter-scale cost aggregation method that includes the guidance generation module and the inter-scale cost aggregation module. Our method can not only upsample the cost volume for reliable detail recovery, but also aggregate cross-scale information to improve the quality of disparity estimation. The experiments on Scene Flow, KITTI 2015 and Middlebury-v3 demonstrated that our method improve the performance of stereo matching over many existing state-of-the-art methods.
This work was supported in part by Natural Science Foundation of China (NSFC) under Grants No. 61773062 and No. 61702037.