GANs have demonstrated great ability to learn image-to-image (i2i) translation from paired  or unpaired images [39, 12] in different domains (e.g. summer/winter, clear/rainy, etc.). The latter relies on cycle-consistency or style/content disentanglement to learn complex mapping in an unsupervised manner, producing respectively a single translation of the source image or a multi-modality translation . This unsupervised i2i translation opened a wide range of applications especially for autonomous driving for which it may be virtually impossible to acquire the same scene in different domains (e.g. say clear/rainy) due to the dynamic nature of the scenes.
Instead, image-to-image translation can be used to generate realistic image synthesis exploitable for both domain adaptation and performance evaluation [9, 16], without additional human-labeling effort. However, state-of-the-art i2i translation fail in some situations. For example, while translating clear to rain the networks tend to change only the global appearance of the scene (wetness, puddle, etc.) ignoring essential traits as drops on the windshield or reflections. Ultimately, this greatly impacts the realism of the generated images.
In this paper, we present a simple domain-bridging technique (Fig. 0(b)) which, opposite to the standard i2i translation (Fig. 0(a)), benefits of additional sub-domains retrieved automatically from web-crawled data. Our method produces qualitatively significantly better results, especially when the source and target domains are known to be far since the bridge ease the learning of the mapping. We apply our i2i methodology to the case of clear rainy images showing that domain bridging leads the translation to preserve drops on the synthetic images, and extend our work to Unsupervised Domain Adaptation (UDA) for which we make novel contributions too (Fig. 2) and demonstrate that all together we perform on par with the most recent UDA methods while being much simpler.
We make three main contributions in our paper:
2 Related work
Early work for image-to-image translation has been done in , where an adversarial-based method has been proposed. The training process required paired samples of the same scene in two different domains. In , instead, cycle consistency is exploited to perform image-to-image translation on unpaired images.  supposes the existence of a shared latent space between images in two domains, and exploits it to perform translations across both domains using a single GAN. Recently, a lot of efforts have been dedicated to reach multimodal translation [40, 12, 18]. Some others, instead, make use of additional information, such as bounding boxes , instance segmentation maps , or semantic segmentation and depth maps , to increase the translation quality and diversity.
Synthetic rain modeling.
Synthetic rain models can be used as an additional tool to generate rainy images. Earlier work has been done by , where an extensive study on drops oscillation modeling has been conducted. Most modern approaches are based on rain streaks or rain drops modeling. As regards rain drops, a GAN-based approach has been recently presented . However, its training requires images collected by specialized hardware. For rain streaks, one of the most recent approaches  provides realistic rendering of rain streaks using a model composed by rain and fog layers. A different method is presented by , where a physics-based rendering is introduced and used to perform data augmentation for different tasks. Nonetheless, all of those only focus on the rain appeareance, while they do not face other visual characteristics of rainy images, as reflections. Other rendering-based methods are available [5, 32, 28], but they are stricly dependant on a complete 3D representation of the environment, limiting their applicability on real road scenes.
Domain adaptation for semantic segmentation.
Most methods for domain adaptation are based on adversarial training, as it regularizes the feature extraction process, making it robust to the domain shift[10, 2, 20, 30, 36, 37, 33, 19, 26]. Complementary approaches, instead, connect the two domains with pixel-level regularization, making use of image-to-image translation GANs [22, 25]. Some recent works combine the two to obtain better results [9, 16]. Others do not use adversarial training at all: for example, Zou et al.  exploit self-supervised learning and pseudo-labels only, while  make use of mutual learning. Finally, some approaches specialized on adverse weather and night adaptation have been recently introduced [27, 6].
Our methodology aims to translate clear images to rainy images reaching both high-qualitative images for both qualitative evaluation and usability to train semantic segmentation network in rainy weather. Thus, our innovations are spread between image-to-image translation (Sec. 3.1) and Unsupervised Domain Adaptation for semantic segmentation (Sec. 3.2).
3.1 Image-to-image translation (i2i)
Image-to-image translation GAN networks learn to approximate the mapping using adversarial training, from two sets of representative images in each domain, denoted here and
. Each image in both set can be interpreted as a sampling from a probability distributionassociated to the image domain . Formally, . GANs are well-known for their instability at training. For the latter to succeed, the network need to be fed with representative sets of images, so that it can extract the common domain characteristics. Even though, some domain gap may be difficult to model for the network, resulting in a loss of characteristic features of the target domain. This may be caused by a significant domain shift or by the lack of data.
Some minor image details still have a significant perceptual impact. This is the case for rain images, where even a few drops on the lens play an important role to sense the rain, as it can be seen in Fig. 3. State-of-the-art networks may ignore some fundamental elements as drops, lens artifacts or reflections, and this ultimately impacts the realism of generated images. We argue that some characteristics changes (drops, artifacts, etc.) are ignored because they are relatively minor compared to other characteristics changes (e.g. wetness, puddles, etc.), and demonstrate the training may benefit from bridging to ease domain mapping.
Studying the specific case of adverse weather conditions, it is possible to formalize a generic domain as the union of finer-grained domains, such as . In it, represents the sub-domains typical of weather, e.g. the presence of precipitations, road wetness, and many more. , instead, is composed by sub-domains unrelated to weather. Examples are the scenario, the city and the illumination. Thus, it is possible to represent and as
Generally, only the joint probability distribution
is estimable, as we have no knowledge about the marginal probability distrubtions, .
On one hand, we hypothesize that it is possible to obtain a more stable image-to-image translation if the differences between the two datasets are minimized. On the other, we have to obtain a GAN that produces an effective transformation, so it is necessary to model correctly all relevant features of adverse weather.
To simultaneously reach both objectives, images collected from web-crawled videos are added to the and datasets, obtaining two new training sets and , with respective domains and , which aims as bridging the gap between the initial domain and .
This is illustrated in Fig. 1.
Our intuition is that adding samples with reasonable criteria will reduce the Kullback-Leibler divergence between the probability distributionsand , w.r.t. . As a consequence, the translation model will be more focused on weather-related characteristics and more stable during training.
Once the main hypothesis are formalized, the approach on how to select new images is needed. Let and be two images sets. As before, we have
We choose and in order to have
where is the set cardinality. Hence, it is possible to identify two image sets and such as .
It is now possible to train the image-to-image translation network on and defined as:
Adding new images, the differences in the global appeareance of the two domains is minimized, while the weather-related domain shift remains constant.
In other words, our approach consists in selecting new image samples, with weather conditions corresponding to those in the original dataset, and join them to the existing data. The newly-added images are required to share some domains unrelated to weather. Retrieving images from the same location and with the same setup ensures that.
In practice, we retrieved these additional samples from web-crawled videos using domain-related keywords search (details in Sec. 4.1.1).
The same bridging can be applied automatically to other domain shifts, though as the domain differences become less semantically evident, human expertise may be required to properly select the and datasets.
We use MUNIT  as backbone for our image-to-image translation network, as it allows disentanglement of style and content, which will be of high interest for us.
3.2 Unsupervised Domain Adaptation (UDA)
Similar to previous works [9, 16], we use our i2i network with domain-bridge to translate images from pre-labeled clear weather datasets and learn semantic segmentation in rain in an unsupervised fashion. We follow the standard UDA practices which is to alternately train in a supervised manner from source (clear) images with labels and train in an self-supervised manner from target (rain) images without labels. Our entire UDA methodology (depicted in Fig. 2) brings two novels contributions: a) we use multi-modal clearrain translations as additional supervised learning, b) we introduce Weighted Pseudo Label - a differentiable extension of Pseudo Label  - to align source and target without any offline process.
Online Multimodal Style-sampling (OMS).
We instead propose to use the multimodal capacity of MUNIT i2i to generate multiple target styles (i.e. rain appearances) for each input image. Styles are sampled during training time. In this way, even if the source image content remains unaltered, it will be possible for the segmentation network to learn different representations of the same scene in the target domain, ultimately leading towards wider diversity and thus more robust detection. Different styles for the same image modify, among other factors, the position and size of drops on the windshield, and the intensity of reflections. This is visible in Fig. 4 showing three arbitrarily sampled styles, where Style 3 consistently produce images that resemble heavy rain.
Weighted Pseudo-Labels (WPL).
Pseudo Label  were proposed as a self-supervised loss to further align source and target distributions. The principle is to self-train a network on target (here, rain) whenever the prediction confidence is above some threshold, thus reinforcing the network beliefs.
Most often for UDA, thresholds are calculated offline as the median per-class confidence dataset-wise [16, 41]. This requires storing all predictions for the whole dataset, which is cumbersome. To overcome this, thresholds may be estimated online image-wise or batch-wise . It has to be noted that thresholds are critical since pseudo-labeling can harm global performances if thresholds were underestimated111In such case, the ratio of wrong pixels over pseudo-labeled pixels will be too high and lead to incorrect self-supervision. or have limited impact if overestimated .
We instead propose Weighted Pseudo-Labels (WPL) which estimates a global threshold within the network optimization process. The general principle of WPL is to weight the self-supervised cross-entropy using learned threshold , thus acting as continuous pseudo-labeling. Not only WPL does not require offline processing, but it is aware of the global network confidence thus leading to better results. In detail, let be an input image and the pseudo label of pixel , such that
where refers to the class probabilities of predicted by the network , and is the best class prediction. In its original implementation , is directly used to weight the cross-entropy self-supervision. Instead, we weight this with a weight matrix of same size than :
The complete loss for WPL is thus defined as the weighted sum of cross-entropy loss and a balancing loss :
where and are loss weights and cross-entropy loss is:
is the one-hot encoding of pseudo-label as in Eq.5 and is the set of classes. In this way, predictions where the network is uncertain are weighted less in the network pseudo-label based training. To avoid that the self-supervised contribution remains set to zero by the optimizer, is required as a balancing loss:
The optimization on leads to a pseudo-label expansion within the training process. Fig. 5 is an illustration of the growing process during training. For the first iteration (Fig. 5, left), the term prevails over , pushing towards 1 thus including in the pseudo-label only pixels with high confidence. With the minimization of (Fig. 5, right), becomes gradually more important, leading the network to simultaneously include lower confidence pixels inside the pseudo-label, and increasing the informative potential of higher-confidence labels. Note that for numerical stability, we assume and estimate .
To balance the self-supervised WPL contribution with the supervised learning in segmentation, we employ a probability-based approach where pseudo-label is applied only if a uniformly sampled variable is above a predefined threshold . Hence, the complete UDA loss function is:
if we train on source data + target, and
if we train on translated images + target. In Eq. 10 and 11, is source image with label , target image, the cross-entropy loss, our segmentation network, our bridged-GAN, and are the Iverson brackets.
4.1 Experimental settings
For i2i and UDA, we use the german dataset Cityscapes  as source (clear), and a subset of the american Berkeley DeepDrive  (BDD) as target (rain). The bridge dataset, only used for the i2i, is a collection of Youtube video. We now detail each dataset.
We train on Cityscapes training set with 2975 pixel-wise annotated images, and evaluate on their validation set with 500 images. While we train on crops, we evaluate on full-size images, i.e. . The trainExtra set, with 19997 images, is also included in the domain bridge to further reduce the domain shift.
We use the coarse weather annotations of BDD together with daylight annotation to obtain a subset we call BDD-rainy (i.e. rainy+daylight), i.e. . For training the rainy+daylight is extracted from the 100k split, while for validation only the 10k split is used. Obviously, duplicates present in both splits are removed. It has to be noted that, while daylight annotation is accurate, weather annotation is approximate and ”rainy” images may either be taken during or after a rain event, thus with or without drops on the windshield. This further increases complexity.
5 sequences (1280 720) were extracted from a single Youtube channel with keywords ”driving” (2 videos) for clear weather and ”driving rain”/”driving heavy rain” (2/1 video) for rainy scenarios. The choice of using videos from a unique channel further reduces domain gaps, ensuring same acquisition setup. Some samples are shown in Fig. 6. Also to maximize image diversity, videos are uniformly sub-sampled into 2x6026 clear weather images and 3x9294 rainy images, leading to a total of 39934 frames.
4.1.2 Networks details
During training, the images are downsampled to be 720 pixels in height, and cropped to 480 480 resolution. The network is trained for 200k iterations, with batch size 1. Adam is used as optimizer, with learning rate 1e-4, , .
We use Light-weight Refinenet 
with Resnet-101 as backbone, pretrained on the full-size Cityscapes dataset. The refining is achieved by training for 100 epochs on 512512 crops, after downscaling images to for GPU memory constraints. We employ data augmentation for the training process, with random rescaling between a factor 0.5 and 2, and random horizontal flipping. The batch size used is 6. We use the SGD optimizer with different learning rates for the encoder (1e-4) and the decoder (1e-3). The momentum is set to 0.9, and the learning rate is divided by 2 every 33 epochs. When pseudo-labels are added to the training, we further refine the network for 70 additional epochs, with constant learning rate divided by 10 w.r.t the initial values. The parameter is initialized to 0.8 and estimated by SGD as well, with learning rate 0.01 and momentum 0.9.
|Method||mIoU||road||sidewalk||building||wall||fence||pole||t light||t sign||veg||terrain||sky||person||rider||car||truck||bus||train||m. bike||bike|
|Ours + OMS||39.72||82.53||44.51||69.97||20.29||22.91||28.93||14.02||29.17||74.32||28.98||83.53||36.75||32.80||71.29||43.03||46.34||0.00||21.80||3.54|
|Ours + OMS + WPL||40.04||84.03||44.09||70.51||24.10||23.02||28.31||14.08||30.07||75.31||27.89||83.49||39.10||33.63||74.70||48.60||49.34||0.00||6.77||3.80|
4.2 Bridged image-to-image translation
We evaluate our bridged i2i (Sec. 3.1) on the Cityscapes to BDD-rainy translation task, and compare results against the recent CycleGAN  and MUNIT .
As stated, our i2i uses a MUNIT based and is refered as MUNIT-bridged.
It is trained on the bridged versions of the two datasets.
Training follow details from Sec. 4.1.2, except for CycleGAN that follows the original implementation222 claims that best performance are obtained keeping constant the learning rate for half the training process (100k iterations in this case) to 2e-4 and then linearly decreasing to 0..
We argue - like others - that GAN metrics aren’t appropriate for such evaluation. Thus, we report qualitative evaluation and segmentation task evaluation, together with usual GAN metrics.
Fig. 7 shows randomly selected samples from the Cityscapes validation set333For MUNIT and our method, MUNIT-bridged, we also randomize the style.. It is visible that both CycleGAN and original MUNIT method fails at modeling the rain appearance, probably due to the large domain gap. In details, CycleGAN brings no realistic changes to the scene appearance, only adjusting color-levels in the image. Original MUNIT, instead, seem to have collapsed and fails to produce significant outputs, probably due instability related to the domain gap. Conversely, our MUNIT-bridged model is the only one able to add realistic traits of rain in the synthetic images, thanks to the domain bridge.
We compute GAN metrics following usual practices from  and report results in Tab. 6(a). The LPIPS distance [35, 31] measures the image diversity , while the Inception Score evaluate both quality and diversity . In details, LPIPS is the average on 19 paired translation of 100 images, and we report the diversity of real data in the target dataset as upper bound
. Inception Score uses the InceptionV3 network previously trained to classify source and target images.
Overall, we successfully improve performances over CycleGAN in both metrics, but original MUNIT has significant higher performance. However, the images generated by MUNIT are evidently unrealistic (cf. Fig. 7) and thus we argue that GAN metrics are unreliable which is in fair alignment with  advocating that Inception Score is uncorrelated with image quality.
For more comprehensive evaluation, we train a segmentation task on GAN translated clearrain images, and evaluate the standard mIoU metric on real rain images, reporting results in Tab. 6(b).
Note that for fair comparison, we only sample a single style for MUNIT-based models, and report results when only trained on clear images as baseline.
If the domain gap were reduced by the GAN translations, an improvement should be visible.
Instead, from the table, training on the original MUNIT-translated dataset leads to a significant performance decrease disproving the high GAN metrics. Finally, our method outperforms CycleGAN by a little margin although CycleGAN fails to produce good quality images. Conversely, our method simultaneously reduces the domain shift and increases realism.
4.3 Unsupervised Domain Adaptation
, the best found recent works. For fair comparison and given architectural similarities, BDL was adapted to work with same segmentation network, data augmentation policy and hyperparameters detailed in Sec.4.1.2.
Quantitative results are shown in Tab. 2 where Ours refers to UDA with only our domain-bridge i2i translation, Ours+OMS using also our Online Multimodal Style-sampling, and Ours+OMS+WPL using also our Weighted Pseudo Label. baseline refers to the training without any UDA. Overall, our methodology performs on par () with BDL the best state-of-the-art, using a much simpler domain adaptation method, and significantly better () than AdaptSegNet. Studying the contributions of our OMS and WPL contributions, all components are necessary to reach the best performances. Qualitative evaluation on the target dataset is shown in Fig. 8, and in fair alignment with quantitative metrics.
Weighted Pseudo Labels.
We evaluate the effectiveness of our WPL proposal and report results in Tab. 3, comparing similar training with either WPL (Ours), batch-wise Pseudo-Label444For batch-wise Pseudo-Label implementation, we compute optimal threshold per class and per batch., or None. For all, the training is performed using as target the whole BDD100k train set (removing duplicates from 10k split) together with the rainy sequences from Domain-bridge dataset, resulting in over 90k images. Performance is reported on target BDD-rainy. We do not compare against offline Pseudo Label, as this would be impractical with such big dataset, and this evaluation is partly encompassed in BDL comparison (cf. Tab. 2). For WPL, we empirically set , (Eq. 7) to balance contributions and (Eq. 10 & 11).
From results in Tab. 3, WPL performs the best and batch-wise Pseudo Label third. In fact, the performance decrease for batch-wise (compared to no self-supervision) may be explained since best batch pixels are used as pseudo-labels, thus possibly implying some incorrect self-supervision in case of low batch accuracy. Instead, our WPL boosts the mIoU on the target dataset which is expected due to its expansion behavior.
In this work, we introduced a novel approach to generate realistic rainy images with an i2i network, while preserving traits of adverse weather that are typically ignored by state-of-the-art architectures. Then, we extended our system and demonstrated its performances in UDA for semantic segmentation, and with a simple pipeline we obtained on par performances w.r.t. the state-of-the-art. Finally, we introduced a novel pseudo labeling strategy that works with an unlimited number of images, and has an optimizable weight parameter used to guide region growing. For future work, we plan to extend our pseudo labeling approach with class-wise thresholds.
-  S. T. Barratt and R. Sharma. A note on the inception score. ArXiv, abs/1801.01973, 2018.
-  M. Biasetton, U. Michieli, G. Agresti, and P. Zanuttigh. Unsupervised domain adaptation for semantic segmentation of urban scenes. In , pages 0–0, 2019.
-  Y. Chen, W. Li, X. Chen, and L. V. Gool. Learning semantic segmentation from synthetic data: A geometrically guided input-output adaptation approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1841–1850, 2019.
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,
U. Franke, S. Roth, and B. Schiele.
The cityscapes dataset for semantic urban scene understanding.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
-  C. Creus and G. A. Patow. R4: Realistic rain rendering in realtime. Computers & Graphics, 37(1-2):33–40, 2013.
-  D. Dai, C. Sakaridis, S. Hecker, and L. Van Gool. Curriculum model adaptation with synthetic and real data for semantic foggy scene understanding. International Journal of Computer Vision, pages 1–23, 2019.
-  K. Garg and S. K. Nayar. Photorealistic rendering of rain streaks. In ACM Transactions on Graphics (TOG), volume 25, pages 996–1002. ACM, 2006.
-  S. S. Halder, J.-F. Lalonde, and R. de Charette. Physics-based rendering for improving robustness to rain. arXiv preprint arXiv:1908.10335, 2019.
-  J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell. Cycada: Cycle-consistent adversarial domain adaptation. arXiv preprint arXiv:1711.03213, 2017.
-  J. Hoffman, D. Wang, F. Yu, and T. Darrell. Fcns in the wild: Pixel-level adversarial and constraint-based adaptation. arXiv preprint arXiv:1612.02649, 2016.
-  X. Hu, C.-W. Fu, L. Zhu, and P.-A. Heng. Depth-attentional features for single-image rain removal. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
-  X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz. Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 172–189, 2018.
A. Iscen, G. Tolias, Y. Avrithis, and O. Chum.
Label propagation for deep semi-supervised learning.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5070–5079, 2019.
P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros.
Image-to-image translation with conditional adversarial networks.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks.
-  Y. Li, L. Yuan, and N. Vasconcelos. Bidirectional learning for domain adaptation of semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6936–6945, 2019.
-  M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In Advances in neural information processing systems, pages 700–708, 2017.
-  M.-Y. Liu, X. Huang, A. Mallya, T. Karras, T. Aila, J. Lehtinen, and J. Kautz. Few-shot unsupervised image-to-image translation. arXiv preprint arXiv:1905.01723, 2019.
-  Y. Luo, L. Zheng, T. Guan, J. Yu, and Y. Yang. Taking a closer look at domain shift: Category-level adversaries for semantics consistent domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2507–2516, 2019.
-  U. Michieli, M. Biasetton, G. Agresti, and P. Zanuttigh. Adversarial learning and self-teaching techniques for domain adaptation in semantic segmentation, 2019.
-  S. Mo, M. Cho, and J. Shin. Instagan: Instance-aware image-to-image translation. arXiv preprint arXiv:1812.10889, 2018.
-  Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, and K. Kim. Image to image translation for domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4500–4509, 2018.
-  V. Nekrasov, C. Shen, and I. D. Reid. Light-weight refinenet for real-time semantic segmentation. In BMVC, 2018.
-  H. Porav, T. Bruls, and P. Newman. I can see clearly now: Image restoration via de-raining. arXiv preprint arXiv:1901.00893, 2019.
-  P. Z. Ramirez, A. Tonioni, and L. Di Stefano. Exploiting semantics in adversarial training for image-level domain adaptation. In 2018 IEEE International Conference on Image Processing, Applications and Systems (IPAS), pages 49–54. IEEE, 2018.
-  P. Z. Ramirez, A. Tonioni, S. Salti, and L. Di Stefano. Learning across tasks and domains. arXiv preprint arXiv:1904.04744, 2019.
-  E. Romera, L. M. Bergasa, K. Yang, J. M. Alvarez, and R. Barea. Bridging the day and night domain gap for semantic segmentation. In 2019 IEEE Intelligent Vehicles Symposium (IV), pages 1312–1318. IEEE, 2019.
-  P. Rousseau, V. Jolivet, and D. Ghazanfarpour. Realistic real-time rain rendering. Computers & Graphics, 30(4):507–518, 2006.
-  T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Advances in neural information processing systems, pages 2234–2242, 2016.
-  S. Sankaranarayanan, Y. Balaji, A. Jain, S. Nam Lim, and R. Chellappa. Learning from synthetic data: Addressing domain shift for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3752–3761, 2018.
-  Z. Shen, M. Huang, J. Shi, X. Xue, and T. Huang. Towards instance-level image-to-image translation. arXiv preprint arXiv:1905.01744, 2019.
-  N. Tatarchuk. Artist-directable real-time rain rendering in city environments. In ACM SIGGRAPH 2006 Courses, pages 23–64. ACM, 2006.
-  Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang, and M. Chandraker. Learning to adapt structured output space for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7472–7481, 2018.
-  H. Xu, Y. Gao, F. Yu, and T. Darrell. End-to-end learning of driving models from large-scale video datasets. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2174–2182, 2017.
R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang.
The unreasonable effectiveness of deep features as a perceptual metric.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018.
-  Y. Zhang, P. David, and B. Gong. Curriculum domain adaptation for semantic segmentation of urban scenes. In Proceedings of the IEEE International Conference on Computer Vision, pages 2020–2030, 2017.
-  Y. Zhang, Z. Qiu, T. Yao, D. Liu, and T. Mei. Fully convolutional adaptation networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6810–6818, 2018.
-  Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu. Deep mutual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4320–4328, 2018.
-  J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.
-  J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman. Toward multimodal image-to-image translation. In Advances in Neural Information Processing Systems, pages 465–476, 2017.
-  Y. Zou, Z. Yu, B. Vijaya Kumar, and J. Wang. Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In Proceedings of the European Conference on Computer Vision (ECCV), pages 289–305, 2018.