Official Pytorch implementation of "StereoGAN: Bridging Synthetic-to-Real Domain Gap by Joint Optimization of Domain Translation and Stereo Matching" (CVPR'20)
Large-scale synthetic datasets are beneficial to stereo matching but usually introduce known domain bias. Although unsupervised image-to-image translation networks represented by CycleGAN show great potential in dealing with domain gap, it is non-trivial to generalize this method to stereo matching due to the problem of pixel distortion and stereo mismatch after translation. In this paper, we propose an end-to-end training framework with domain translation and stereo matching networks to tackle this challenge. First, joint optimization between domain translation and stereo matching networks in our end-to-end framework makes the former facilitate the latter one to the maximum extent. Second, this framework introduces two novel losses, i.e., bidirectional multi-scale feature re-projection loss and correlation consistency loss, to help translate all synthetic stereo images into realistic ones as well as maintain epipolar constraints. The effective combination of above two contributions leads to impressive stereo-consistent translation and disparity estimation accuracy. In addition, a mode seeking regularization term is added to endow the synthetic-to-real translation results with higher fine-grained diversity. Extensive experiments demonstrate the effectiveness of the proposed framework on bridging the synthetic-to-real domain gap on stereo matching.READ FULL TEXT VIEW PDF
Image dehazing using learning-based methods has achieved state-of-the-ar...
Recently proposed DNN-based stereo matching methods that learn priors
Unsupervised cross-spectral stereo matching aims at recovering disparity...
End-to-end deep networks represent the state of the art for stereo match...
Current methods for single-image depth estimation use training datasets ...
Most existing methods of depth from stereo are designed for daytime scen...
In this paper, we attempt to solve the domain adaptation problem for dee...
Official Pytorch implementation of "StereoGAN: Bridging Synthetic-to-Real Domain Gap by Joint Optimization of Domain Translation and Stereo Matching" (CVPR'20)
With the fast development of deep neural networks[23, 12] and large-scale benchmarks [31, 13, 7]
, deep learning-based stereo matching methods have made great progress in the past decade[29, 19]. These methods, however, relying on a large quantity of high-quality left-right-disparity training data. Although the input images to the stereo matching networks ( i.e., left and right images) are relatively easy to collect using stereo rigs in the real world, their corresponding ground-truth disparities are very difficult to collect. Instead, researchers tend to create synthetic training datasets [29, 31, 13] with perfect disparities. In this way, the demand of large quantity of training data is alleviated. However, the non-negligible domain gaps between synthetic and real must be considered when generalizing to real domains. In order to mitigate the domain gaps, some of the previous works [1, 40] train their models in two stages. Firstly the model is trained on synthetic dataset and then fine-tuned on a particular real dataset in either supervised [30, 1, 11] or unsupervised manner [38, 39]. In this paper, we focus on the latter one, a more challenging task with no ground-truth for the real target-domain data.
. Moreover, these methods introduce extra computation compared to a feed-forward neural network, although they have striven to reduce the computation complexity of updating network parameters.
Recently, unsupervised image-to-image translation models achieved great success [47, 25, 24] and thus were adopted in domain adaptation methods to tackle many applications such as semantic segmentation, person re-identification and object detection [15, 41, 32, 3]. However, it is non-trivial to generalize this series of methods to stereo matching. The middle row of Figure 1 reveals two main challenges for translation in stereo matching. 1) The general image-to-image translation does not take epipolar constraints into consideration, which leads to inconsistent textures and thus ambiguity of disparity, as emphasized by red circles. 2) It only attempts to transfer domain styles while neglecting the fact that its purpose should be serving the stereo matching networks. For instance, since most background of our synthetic images is brown mountains while that of real images in the training set is blue sky, the vanilla CycleGAN  regards this to be domain style and tries to translate from brown mountains to blue sky as shown in the first two rows of Figure 1. This would confuse stereo matching network, because the useful textures for stereo matching in the sky is definitely much less than those in the mountains. In this paper, we successfully addressed these two challenges by properly designed stereo constraints and joint training scheme. The intermediate image translation results are shown in the bottom row of Figure 1.
In particular, we propose an end-to-end deep learning framework consisting of domain translation and stereo matching networks to estimate stereo disparity on the target domain, using only source-domain synthetic stereo image pairs with ground-truth disparity and target-domain real stereo image pairs without any annotation. The stereo image translation is constrained by a novel bidirectional multi-scale feature re-projection loss and a correlation consistency loss. The former one is realized by a multi-scale feature re-projection module. For feature maps at each layer of domain translation networks, the inverse warping  of the right feature map according to the given disparity should be as close as its corresponding left feature map. Both ground-truth disparity for synthetic data and estimated disparity for real data would contribute to joint training in a bidirectional manner. We also introduce a correlation consistency loss to ensure that the reconstructed stereo images should maintain consistent correlation feature maps, which are extracted from the stereo matching network, with those original images.
In addition, we observed that real stereo pairs usually do not exactly match each other due to different camera configurations and settings. To this end, inspired by successful applications of using noise to manipulate image [18, 28], we propose a mode seeking regularization term to ensure the fine-grained diversity in synthetic-to-real translation, as shown in Figure 2. As we could observe as circled in red, the local intensity between the left image and right image varies, which simulates the real data. With such augmentation, the domain translation makes the stereo matching in the real domain more robust and effective.
In summary, our contributions are listed as follows:
We for the first time combine unsupervised domain translation with disparity estimation in an end-to-end framework to tackle the challenging problem of stereo matching in the absence of real ground-truth disparities.
We propose novel stereo constraints including the bidirectional multi-scale feature re-projection loss and the correlation consistency loss, which better regularizes this joint framework to achieve stereo-consistent translation and accurate stereo matching. The additional mode seeking regularization endows the synthetic-to-real translation with higher fine-grained diversity.
Extensive experiments demonstrate that our proposed model outperforms the state-of-the-art unsupervised adaptation approaches for stereo matching.
Stereo matching conventionally follows a four-step pipeline including matching cost computation, cost aggregation, disparity optimization and post-processing . Local descriptors such as absolute difference (AD), sum of squared difference (SAD) and so on are usually adopted for measuring left-right inconsistency, so as to calculate matching costs for all possible disparities. Cost aggregation and disparity optimization are usually treated as a D graph partitioning problem, which could be optimized by graph cut  or belief propagation [37, 21]. Semi-global matching (SGM)  approximates the global optimization with dynamic programming.
Deep learning-based stereo matching methods have achieved great progress due to the rise of deep neural networks [23, 12] and large-scale benchmarks [8, 7] in the last decade. Among them, Zbontar and LeCun  for the first time presented the computation of stereo matching costs by a deep Siamese network. Luo et al.  accelerated the computation of matching costs by correlating unary features. Recently, many end-to-end neural networks were developed to directly predict the whole disparity maps from stereo image pairs [29, 30, 34, 43, 19, 1, 44, 11]. Among them, DispNet  is a pioneer work which for the first time uses an end-to-end deep learning framework to directly regress disparity maps. The follow-up work GCNet  introduces D convolutional networks to aggregate contextual information for obtaining better cost volumes.
Domain adaptation methods have shown great potential in filling the gap between synthetic and real domains. Previous works attempted to solve this problem by either learning domain-invariant representations [4, 5] or pushing two domain distributions to be close [9, 42, 35, 36]. For example, the gap between source and target domain could be filled by matching the distribution [10, 26] or statistics [35, 36]
of deep features.
Recently, unsupervised image-to-image translation models achieved great success under unpaired setting [47, 25, 24] and thus were applied as domain adaptation methods in many applications including semantic segmentation, person re-identification and object detection [15, 41, 32, 3].
In the field of stereo matching, unsupervised online adaptation advanced great progress. These methods first train a disparity estimation network on synthetic data and then fine-tune it online using unsupervised loss such as re-projection loss when continuously accessing new stereo pairs from other domains [38, 40]. This unsupervised adaptation strategy is then incorporated in a meta-learning framework .
Given a set of synthetic left-right-disparity tuples in the source domain , where , and a set of real stereo images in the target domain without any ground-truth disparity, where , our goal is to learn an accurate disparity estimation network for estimating the disparity on the target domain.
For the sake of clear formulation, we define a paired set where stands for a paired stereo image, i.e., a left image and its corresponding right image (see Eqs. (4-7)). We also define an unpaired set where we can only sample a single left or right image (see Eqs. (1-2)).
Different from previous works that directly train stereo matching network with synthetic data [29, 19, 40], we propose a joint domain translation and stereo matching framework, which aims to translate synthetic-style stereo images into realistic ones with novel stereo constraints and thus better cooperate with the stereo matching network in an end-to-end manner, as shown in Figure 3.
Cycle-consistency domain translation loss. To help synthetic-to-real translation network capture the global domain style of the real datasets, we adopt a real domain discriminator whose goal is to distinguish synthetic-to-real generated images from real-domain images. On the contrary, learns to generate images that look similar to real-domain images to fool the real domain discriminator . These two sub-nets constitute a minimax game that optimizes in an adversarial manner and achieves optimal when cannot tell whether images are generated or not. The adversarial loss for synthetic-to-real generation is formulated as:
where means a single real image is sampled from the non-paired real-domain set . We also introduce a similar adversarial loss for supervising the process of real-to-synthetic generation as .
Adversarial losses could only supervise and to produce images that are not distinguishable by domain discriminators, but any random permutation of outputs can happen without any other constraints. In order to regularize and to be one-to-one mapping, the cycle consistency loss is also adopted,
To sum up, the cycle-consistency domain translation loss following the CycleGAN  can be defined as
Stereo matching loss. Since our goal is to learn a mapping from real-domain stereo image to disparity map with only annotated synthetic stereo images and unlabeled real ones, it is straight-forward to take advantage of the results of synthetic-to-real translation. Given a paired synthetic tuple , we argue that the translated stereo pair could be regarded as real-domain images and such translated stereo pair should match its ground-truth disparity . Therefore, we formulate the stereo matching loss as:
where is the stereo matching network for estimating disparities from real-domain stereo images.
These two losses construct a simple framework that optimizes stereo matching network with the assistance of domain translation networks. However, it may introduce the problem of pixel distortion and stereo mismatch during translation.
To tackle the above mentioned challenges, we should ensure that domain translation networks only transfer global domain style while maintain the epipolar consistency, which contributes to the improvement of stereo matching. To achieve this, we propose a joint optimization scheme between domain translation and stereo matching with novel constraints.
Before diving into novel constraints, we would first introduce our newly-proposed multi-scale feature re-projection module, which establishes a bidirectional connection between domain translation component and stereo matching component by left-right consistency check, as illustrated in Figure 4. For each intermediate layer of domain translation networks, the inversely warped right feature map should be the same as its corresponding left feature map. This inverse warping operation is completed with properly downsampled disparity map using differentiable bilinear sampling technique . Note that the given disparity could be either ground-truth one for synthetic stereo or estimated one for real stereo, which calculate feature re-projection loss for synthetic or real stereo images respectively. The former endows the domain translation networks with strong epipolar constraints while the latter provides extra supervision for training stereo matching network.
Feature re-projection loss for synthetic images. We argue that the intermediate feature maps for generating the domain-translated left and right images should be the same at 3D physical locations. To model this constraint, we utilize synthetic ground-truth disparity to warp the intermediate feature maps of both and along the synthetic-real-synthetic cycle translation. If the stereo image pairs are well translated, the inversely warped right feature map should match the left feature exactly. The feature re-projection loss for synthetic images is formulated as
where is the total number of layers of translation networks, denotes the feature of image at th-layer the translation network , the inverse warping function warps the right feature map with the ground-truth disparity .
Feature re-projection loss for real images. For a general stereo matching network such as DispNet , it naturally outputs multi-scale disparities, which can be formed from correlation features at different neural network layers. These multi-scale disparity maps can be used to warp the intermediate feature maps for both and along the real-synthetic-real cycle translation. Then the distance between the left feature and the inversely warped right feature provides an extra supervision for updating the parameters of disparity estimation network . This loss could be formulated as
where is the estimated disparity of real stereo image pairs by .
Different from previous works which directly warp images at the origin scale [6, 46], our warping operation is based on multi-scale feature maps. Since features at different layers model image structures of different scales, this constraint could help supervise the training of stereo matching network from multiple scales (from global to local regions), leading to impressive improvement on disparity estimation accuracy. In addition, it leaves some space for fine-grained noise modeling upon pixel level (see Figure 2), which would be introduced in the mode seeking regularization term, described later in this section.
Correlation consistency loss. Feature re-projection losses may not totally address the stereo-mismatch issue yet. Since there is no ground-truth disparity for real-domain stereo images, warping features with estimated disparity may introduce some bias into the joint framework. For example, the value of for a certain left-right-disparity tuple may be , but it still makes a limited effect on stereo matching, even makes a negative effect. This is because the phenomenon of pixel distortion during domain translation and inaccurate estimation during stereo matching occur simultaneously.
To reduce such impact, stereo matching network is utilized to supervise both and along the real-synthetic-real cycle translation. We denote the reconstructed real image by such cycle translation as for ease of presentation. Given a pair of real stereo images , we could obtain their reconstructed pair . The correlation features of from each layer of stereo matching network should match those of . In addition, we make a cross-pair for constructing a tighter loss, which is calculated by pushing correlation features of both and to be close to those of . Therefore, we formulate this constraint for real-domain images as the correlation consistency loss between multi-layer correlation features:
where is the total number of correlation aggregation layers which are after the individual image feature encoding layers and denotes the correlation aggregation feature of the stereo pair at th-layer of the stereo matching network .
|Dataset||Method||D1-all (%)||EPE||>2px (%)||>4px (%)||>5px (%)||Time|
Mode seeking loss. The above losses could well maintain the stereo consistency of the domain-translated images. However, in practice, the stereo images also show slight variations between the left and right images, because of sensor noise, different camera configurations, etc. To model such left-right image variations, we propose a mode seeking regularization term following  to make the generators create small but realistic variations between the generated left and right images, as demonstrated in Figure 2. A Gaussian random map is introduced into the synthetic-to-real translation networks to model the variations of the generated images. When training domain translation networks, we attempt to maximize the distance between two generated outputs from the same original image with two different random maps and , where
Putting all the losses introduced above into an overall objective function, we obtain
where weigh the relative importance among different objectives. We would discuss the effectiveness of each objective in Section 4 by ablation study. Our final goal is to solve the following optimization problem:
Network and training. We adopt the architecture for our generator and dicriminator networks from CycleGAN  with patch discriminator  and take DispNet  as our stereo matching network. We implement this method on Pytorch. For training our proposed joint domain translation and stereo matching framework, we partition the training into two stages. In the warm-up stage, we first train the domain translation networks with only and for epochs, using Adam optimizer  with the momentum and learning rate . Then we train the stereo matching network with only for epochs, using Adam optimizer with the momentum and learning rate . In the second stage, we train these two components together in an end-to-end manner and maintain the hyper-parameters unchanged. We alternatively optimize domain translation nets and stereo matching net with the full objective. We empirically set the trade-off factors as , , , , and .
Datasets. We take three datasets to testify the effectiveness of our proposed method. Two of them are synthetic datasets and the last one is real dataset. The first is Driving, a subset of a large synthetic dataset Sceneflow , which describes a virtual-world car driving scene. It contains fast sequences and slow sequences with both forward driving and backward driving scenes, the number of images summing up to totally. The image size in this dataset is and the range of disparity value is . The second is Synthia-SF , which contains sequences featuring different scenarios and traffic conditions. There are images with associated ground-truth disparity maps. The image size is and its range of disparity is similar to Driving dataset. The last real dataset is KITTI2015 , containing 200 training images collected in real scenarios. Its image size is around with disparity ranging from to around . Due to the inconsistency of object size between Synthia-SF and KITTI2015, we resize all images in Synthia-SF to half and the corresponding disparity value is divided by .
. We testify the effectiveness of our proposed method by the following evaluation metrics. End-point error (EPE) is the mean average disparity error in pixels. D1-all means the percentage of pixels whose absolute disparity error is larger thanpixels or of ground-truth disparity value. Percentages of erroneous pixels larger than are reported. All these evaluation metrics are calculated for both non-occluded (Noc) and all (All) pixels. The inference time on single TITAN-X GPU is also recorded.
|Dataset||Ablation||D1-all (%)||EPE||>2px (%)||>4px (%)||>5px (%)|
We first investigate whether the proposed method is superior to other related methods or not, whose results are summarized in Table 1. We take two synthetic data - Synthia and Driving as our source-domain dataset, and one real dataset - KITTI2015 as our target-domain dataset. A dubbed method without domain translation, which is called Inference, is to train the stereo matching network on synthetic data and then directly predict disparity map on real data. Two state-of-the-art unsupervised adaptation methods for stereo matching are compared. Particularly, we use SL+Ad to denote unsupervised online adaptation method described in  and use L2A+Wad to denote unsupervised adaptation via meta learning framework described in . Moreover, since there is no stereo matching-specific domain adaptation technique developed, we choose CycleGAN  as our baseline for comparison. For the sake of fair comparison, we set the stereo matching network of all methods to DispNet .
As could be seen from Table 1, all of the methods perform better on Synthia-to-KITTI2015 than Driving-to-KITTI2015 because there is a larger gap between Driving and KITTI2015. Among these methods, Inference perform worst due to the natural gap between synthetic and real domain. SL+Ad updates the stereo matching network by calculating the error between the inversely-warped left image and real left image when accessing new stereo images. L2A+Wad proposes a novel weight confidence-guided adaptation technique and updates the network in a meta-learning manner. These two methods mitigated the domain gap to a little bit extent but meanwhile brought some extra calculation burden to inference process. Their inference time increase from seconds to and seconds respectively. The translation results of CycleGAN have the problem of pixel distortion, as introduced in Section 1, so it performed not well enough. The proposed joint domain translation and stereo matching framework, with novel stereo constraints, beat all the above methods by reducing the number of erroneous pixels considerably. The significant improvements in all evaluation metrics demonstrate the superiority of our method. In addition, the inference time of our method is same as that of original DispNet because all the extra domain translation and auxiliary training is completed in the procedure of offline training.
We then investigate how each objective term influence the performance of unsupervised stereo matching quantitatively by ablation study. Besides cycle domain translation loss and stereo matching loss, we propose four novel objectives for regularizing the basic problem formulation including correlation consistency loss, mode seeking loss, feature re-projection loss for real stereo and for synthetic stereo. We would train our joint framework by removing one of them and then record the corresponding D1-all, EPE, and bad pixel percentage with threshold , and , as summarized in Table 2. The results of ablation study on both Synthia and Driving source dataset show similar trend. In general, feature re-projection loss for synthetic stereo and real stereo is more effective than that of correlation consistency loss and mode seeking loss. We try to analyze the reasons in the following.
First of all, among all four proposed objectives, feature re-projection loss for synthetic stereo is most effective on our joint framework. The reasons are as follows: 1) it ensures that translated outputs be stereo-consistent with inputs, which is vital to stereo matching loss in the presence of a large amount accurate disparities; 2) it benefits the training of stereo matching network with feature re-projection loss for real stereo by well-learned translation networks.
The effect of feature re-projection loss for real stereo is runner-up, because it actually provides extra training signals for training stereo matching network. However, such supervision signals are obtained from the warping of features in domain translation networks, so its performance is highly dependent on how well domain translation networks are trained by to a large degree.
Thirdly, correlation consistency loss may contribute to this framework marginally in the presence of feature re-projection losses. It serves as a complement to . As analyzed above, feature re-projection loss for real stereo images usually benefits from the well-trained translation networks by . However, sometimes the value of feature re-projection loss for real stereo images may be low, but contrarily, both pixel distortion in translation and inaccurate estimation in stereo matching occur simultaneously. This correlation consistency loss could help only at this time.
Finally, D1-all results would drop a little bit without mode seeking loss. Because mode seeking loss actually provides fine-grained diversity to translated results and essentially helps stereo matching network learn a more robust disparity estimation network. In other words, stereo matching networks would learn to reduce the influence of various noise and lighting conditions during training.
Thanks to the integration of all the above four objectives described in Equation 9, we have obtained great improvement on filling the synthetic-to-real gap in stereo matching.
In this part, we show how the structure of stereo matching network influences the performance of our proposed joint domain translation and stereo matching framework. We compare DispNet with one of the recently-proposed state-of-the-art stereo matching model GwcNet . Their D1-all and EPE scores and inference time are reported in Table 3. As can be seen, GwcNet  performs far better than DispNet on both datasets and evaluation metrics. When using Synthia as our synthetic training data, our proposed model could help DispNet reduce D1-all and EPE by around . It also makes GwcNet reduce D1-all by and reduce EPE by . For Driving training data whose domain gap to KITTI2015 is larger, our method could also help stereo matching network obtain very competitive performance. After trained with our proposed framework, D1-all is reduced by and EPE for DispNet respectively and D1-all is reduced by and EPE by for GwcNet respectively.
To demonstrate the generalization capability of stereo matching network trained in our joint optimization framework, we test their performance on other two real datasets - KITTI2012  and Cityscapes , whose results are summarized in Table 4. Images in KITTI2012 have very similar domain style to those in KITTI2015 due to their similar camera setting. Therefore, the performance gain with the help of domain translation on KITTI2012 is similar to that on KITTI2015. For Cityscapes real dataset, both D1-all and EPE scores almost reduce by half. These significant improvements demonstrate great generalization capability of our proposed joint framework.
In this paper we propose a novel end-to-end framework that trains domain translation networks and stereo matching network jointly. The newly-introduced stereo constraints including correlation consistency loss, bi-directional multi-scale feature re-projection loss and mode seeking loss regularize this joint framework to achieve better performance on stereo matching without ground-truth. The experimental results testify the effectiveness of our proposed framework in bridging the synthetic-to-real domain gap.
Our proposed framework successfully mitigated the gap between synthetic and real domain, yet there usually exist other gaps on intrinsics and disparity distribution between real-domain stereo images and translated-real stereo images, which is not explicit in our experimental datasets. Further study is also required to facilitate the generalization capability of our framework when meeting such datasets.
Acknowledgement. This work is supported in part by SenseTime Group Limited, and in part by the General Research Fund through the Research Grants Council of Hong Kong under Grants CUHK14202217, CUHK14203118, CUHK14205615, CUHK14207814, CUHK14213616, CUHK14207319, CUHK14208619, and in part by Research Impact Fund R5001-18.
The cityscapes dataset for semantic urban scene understanding.In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
Unsupervised domain adaptation by backpropagation.In
Proceedings of the 32nd International Conference on Machine Learning, pages 1180–1189, 2015.
Image-to-image translation with conditional adversarial networks.In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, 2017.
Thirty-Second AAAI Conference on Artificial Intelligence, 2018.