Domain Bridge for Unpaired Image-to-Image Translation and Unsupervised Domain Adaptation

10/23/2019 ∙ by Fabio Pizzati, et al. ∙ 18

Image-to-image translation architectures may have limited effectiveness in some circumstances. For example, while generating rainy scenarios, they may fail to model typical traits of rain as water drops, and this ultimately impacts the synthetic images realism. With our method, called domain bridge, web-crawled data are exploited to reduce the domain gap, leading to the inclusion of previously ignored elements in the generated images. We make use of a network for clear to rain translation trained with the domain bridge to extend our work to Unsupervised Domain Adaptation (UDA). In that context, we introduce an online multimodal style-sampling strategy, where image translation multimodality is exploited at training time to improve performances. Finally, a novel approach for self-supervised learning is presented, and used to further align the domains. With our contributions, we simultaneously increase the realism of the generated images, while reaching on par performances w.r.t. the UDA state-of-the-art, with a simpler approach.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 4

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

(a) Naive image-to-image translation
(b) Domain-bridged image-to-image translation
Figure 1: Naive image-to-image translation (Fig. 0(a)) learns the domain mapping. Conversely, our domain bridge (Fig. 0(b)) completes source and target domains with automatically retrieved web-crawled data (, ) which share common characteristics, thus easing the image-to-image translation task.

GANs have demonstrated great ability to learn image-to-image (i2i) translation from paired [14] or unpaired images [39, 12] in different domains (e.g. summer/winter, clear/rainy, etc.). The latter relies on cycle-consistency or style/content disentanglement to learn complex mapping in an unsupervised manner, producing respectively a single translation of the source image or a multi-modality translation [12]. This unsupervised i2i translation opened a wide range of applications especially for autonomous driving for which it may be virtually impossible to acquire the same scene in different domains (e.g. say clear/rainy) due to the dynamic nature of the scenes.

Instead, image-to-image translation can be used to generate realistic image synthesis exploitable for both domain adaptation and performance evaluation [9, 16], without additional human-labeling effort. However, state-of-the-art i2i translation fail in some situations. For example, while translating clear to rain the networks tend to change only the global appearance of the scene (wetness, puddle, etc.) ignoring essential traits as drops on the windshield or reflections. Ultimately, this greatly impacts the realism of the generated images.

In this paper, we present a simple domain-bridging technique (Fig. 0(b)) which, opposite to the standard i2i translation (Fig. 0(a)), benefits of additional sub-domains retrieved automatically from web-crawled data. Our method produces qualitatively significantly better results, especially when the source and target domains are known to be far since the bridge ease the learning of the mapping. We apply our i2i methodology to the case of clear rainy images showing that domain bridging leads the translation to preserve drops on the synthetic images, and extend our work to Unsupervised Domain Adaptation (UDA) for which we make novel contributions too (Fig. 2) and demonstrate that all together we perform on par with the most recent UDA methods while being much simpler.

We make three main contributions in our paper:

  • i2i: we propose a novel domain-bridge (Sec. 3.1) to augment automatically the source and target domains and ease i2i mapping,

  • i2i with UDA: online multimodal style-sampling (Sec. 3.2) is applied for UDA, thus increasing the translation diversity,

  • UDA: we propose novel Weighted Pseudo Label (Sec. 3.2) to benefit from self-supervision without the need of offline processing as for the original Pseudo Label [15].

2 Related work

Image-to-image translation.

Early work for image-to-image translation has been done in [14], where an adversarial-based method has been proposed. The training process required paired samples of the same scene in two different domains. In [39], instead, cycle consistency is exploited to perform image-to-image translation on unpaired images. [17] supposes the existence of a shared latent space between images in two domains, and exploits it to perform translations across both domains using a single GAN. Recently, a lot of efforts have been dedicated to reach multimodal translation [40, 12, 18]. Some others, instead, make use of additional information, such as bounding boxes [31], instance segmentation maps [21], or semantic segmentation and depth maps [3], to increase the translation quality and diversity.

Synthetic rain modeling.

Synthetic rain models can be used as an additional tool to generate rainy images. Earlier work has been done by [7], where an extensive study on drops oscillation modeling has been conducted. Most modern approaches are based on rain streaks or rain drops modeling. As regards rain drops, a GAN-based approach has been recently presented [24]. However, its training requires images collected by specialized hardware. For rain streaks, one of the most recent approaches [11] provides realistic rendering of rain streaks using a model composed by rain and fog layers. A different method is presented by [8], where a physics-based rendering is introduced and used to perform data augmentation for different tasks. Nonetheless, all of those only focus on the rain appeareance, while they do not face other visual characteristics of rainy images, as reflections. Other rendering-based methods are available [5, 32, 28], but they are stricly dependant on a complete 3D representation of the environment, limiting their applicability on real road scenes.

Domain adaptation for semantic segmentation.

Most methods for domain adaptation are based on adversarial training, as it regularizes the feature extraction process, making it robust to the domain shift 

[10, 2, 20, 30, 36, 37, 33, 19, 26]. Complementary approaches, instead, connect the two domains with pixel-level regularization, making use of image-to-image translation GANs [22, 25]. Some recent works combine the two to obtain better results [9, 16]. Others do not use adversarial training at all: for example, Zou et al. [41] exploit self-supervised learning and pseudo-labels only, while [38] make use of mutual learning. Finally, some approaches specialized on adverse weather and night adaptation have been recently introduced [27, 6].

Figure 2:

Overview of our pipeline for unsupervised domain adaptation. The blue dashed square means that the GAN parameters are freezed. The Image-to-image translation network is trained offline with our domain bridge strategy. Different line colors refer to different probabilities for one path to be executed. Loss functions are denoted with dotted lines.

Figure 3: Rainy images from the Berkeley Deepdrive dataset [34]. Drops on the windshield or reflections help us perceive that it is raining.

3 Method

Our methodology aims to translate clear images to rainy images reaching both high-qualitative images for both qualitative evaluation and usability to train semantic segmentation network in rainy weather. Thus, our innovations are spread between image-to-image translation (Sec. 3.1) and Unsupervised Domain Adaptation for semantic segmentation (Sec. 3.2).

3.1 Image-to-image translation (i2i)

Image-to-image translation GAN networks learn to approximate the mapping using adversarial training, from two sets of representative images in each domain, denoted here and

. Each image in both set can be interpreted as a sampling from a probability distribution

associated to the image domain [17]. Formally, . GANs are well-known for their instability at training. For the latter to succeed, the network need to be fed with representative sets of images, so that it can extract the common domain characteristics. Even though, some domain gap may be difficult to model for the network, resulting in a loss of characteristic features of the target domain. This may be caused by a significant domain shift or by the lack of data.

Some minor image details still have a significant perceptual impact. This is the case for rain images, where even a few drops on the lens play an important role to sense the rain, as it can be seen in Fig. 3. State-of-the-art networks may ignore some fundamental elements as drops, lens artifacts or reflections, and this ultimately impacts the realism of generated images. We argue that some characteristics changes (drops, artifacts, etc.) are ignored because they are relatively minor compared to other characteristics changes (e.g. wetness, puddles, etc.), and demonstrate the training may benefit from bridging to ease domain mapping.

Domain bridging.

Studying the specific case of adverse weather conditions, it is possible to formalize a generic domain as the union of finer-grained domains, such as . In it, represents the sub-domains typical of weather, e.g. the presence of precipitations, road wetness, and many more. , instead, is composed by sub-domains unrelated to weather. Examples are the scenario, the city and the illumination. Thus, it is possible to represent and as

(1)

Generally, only the joint probability distribution

is estimable, as we have no knowledge about the marginal probability distrubtions

, .

On one hand, we hypothesize that it is possible to obtain a more stable image-to-image translation if the differences between the two datasets are minimized. On the other, we have to obtain a GAN that produces an effective transformation, so it is necessary to model correctly all relevant features of adverse weather. To simultaneously reach both objectives, images collected from web-crawled videos are added to the and datasets, obtaining two new training sets and , with respective domains and , which aims as bridging the gap between the initial domain and . This is illustrated in Fig. 1.

Our intuition is that adding samples with reasonable criteria will reduce the Kullback-Leibler divergence between the probability distributions

and , w.r.t. . As a consequence, the translation model will be more focused on weather-related characteristics and more stable during training.

Once the main hypothesis are formalized, the approach on how to select new images is needed. Let and be two images sets. As before, we have

(2)

We choose and in order to have

(3)

where is the set cardinality. Hence, it is possible to identify two image sets and such as .

It is now possible to train the image-to-image translation network on and defined as:

(4)

Adding new images, the differences in the global appeareance of the two domains is minimized, while the weather-related domain shift remains constant.

In other words, our approach consists in selecting new image samples, with weather conditions corresponding to those in the original dataset, and join them to the existing data. The newly-added images are required to share some domains unrelated to weather. Retrieving images from the same location and with the same setup ensures that.

In practice, we retrieved these additional samples from web-crawled videos using domain-related keywords search (details in Sec. 4.1.1). The same bridging can be applied automatically to other domain shifts, though as the domain differences become less semantically evident, human expertise may be required to properly select the and datasets.

We use MUNIT [12] as backbone for our image-to-image translation network, as it allows disentanglement of style and content, which will be of high interest for us.

Source

Style 1

Style 2

Style 3

Figure 4: Examples of multimodal style-sampling from our Domain-bridged i2i (Sec. 3.1). Note the consistency of style regardless of the source image.

3.2 Unsupervised Domain Adaptation (UDA)

Similar to previous works [9, 16], we use our i2i network with domain-bridge to translate images from pre-labeled clear weather datasets and learn semantic segmentation in rain in an unsupervised fashion. We follow the standard UDA practices which is to alternately train in a supervised manner from source (clear) images with labels and train in an self-supervised manner from target (rain) images without labels. Our entire UDA methodology (depicted in Fig. 2) brings two novels contributions: a) we use multi-modal clearrain translations as additional supervised learning, b) we introduce Weighted Pseudo Label - a differentiable extension of Pseudo Label [15] - to align source and target without any offline process.

Online Multimodal Style-sampling (OMS).

The standard strategy for UDA with i2i networks is to learn from the offline translation of the whole dataset [9, 16].

We instead propose to use the multimodal capacity of MUNIT i2i to generate multiple target styles (i.e. rain appearances) for each input image. Styles are sampled during training time. In this way, even if the source image content remains unaltered, it will be possible for the segmentation network to learn different representations of the same scene in the target domain, ultimately leading towards wider diversity and thus more robust detection. Different styles for the same image modify, among other factors, the position and size of drops on the windshield, and the intensity of reflections. This is visible in Fig. 4 showing three arbitrarily sampled styles, where Style 3 consistently produce images that resemble heavy rain.

Weighted Pseudo-Labels (WPL).

Pseudo Label [15] were proposed as a self-supervised loss to further align source and target distributions. The principle is to self-train a network on target (here, rain) whenever the prediction confidence is above some threshold, thus reinforcing the network beliefs.

Most often for UDA, thresholds are calculated offline as the median per-class confidence dataset-wise [16, 41]. This requires storing all predictions for the whole dataset, which is cumbersome. To overcome this, thresholds may be estimated online image-wise or batch-wise [13]. It has to be noted that thresholds are critical since pseudo-labeling can harm global performances if thresholds were underestimated111In such case, the ratio of wrong pixels over pseudo-labeled pixels will be too high and lead to incorrect self-supervision. or have limited impact if overestimated [15].

We instead propose Weighted Pseudo-Labels (WPL) which estimates a global threshold within the network optimization process. The general principle of WPL is to weight the self-supervised cross-entropy using learned threshold , thus acting as continuous pseudo-labeling. Not only WPL does not require offline processing, but it is aware of the global network confidence thus leading to better results. In detail, let be an input image and the pseudo label of pixel , such that

(5)

where refers to the class probabilities of predicted by the network , and is the best class prediction. In its original implementation [15], is directly used to weight the cross-entropy self-supervision. Instead, we weight this with a weight matrix of same size than :

(6)

The complete loss for WPL is thus defined as the weighted sum of cross-entropy loss and a balancing loss :

(7)

where and are loss weights and cross-entropy loss is:

(8)

is the one-hot encoding of pseudo-label as in Eq. 

5 and is the set of classes. In this way, predictions where the network is uncertain are weighted less in the network pseudo-label based training. To avoid that the self-supervised contribution remains set to zero by the optimizer, is required as a balancing loss:

(9)
Figure 5: Analysis of the effect of optimization. During training the Weighted Pseudo Label expands from high confidence pixels only (left) to lower ones (right).

The optimization on leads to a pseudo-label expansion within the training process. Fig. 5 is an illustration of the growing process during training. For the first iteration (Fig. 5, left), the term prevails over , pushing towards 1 thus including in the pseudo-label only pixels with high confidence. With the minimization of (Fig. 5, right), becomes gradually more important, leading the network to simultaneously include lower confidence pixels inside the pseudo-label, and increasing the informative potential of higher-confidence labels. Note that for numerical stability, we assume and estimate .

Losses.

To balance the self-supervised WPL contribution with the supervised learning in segmentation, we employ a probability-based approach where pseudo-label is applied only if a uniformly sampled variable is above a predefined threshold . Hence, the complete UDA loss function is:

(10)

if we train on source data + target, and

(11)

if we train on translated images + target. In Eq. 10 and 11, is source image with label , target image, the cross-entropy loss, our segmentation network, our bridged-GAN, and are the Iverson brackets.

4 Experiments

We now evaluate the performance of both our i2i proposal (Sec. 4.2) and our UDA proposal (Sec. 4.3) on the clearrain problem using clear/rain datasets recorded with different setups.

4.1 Experimental settings

4.1.1 Datasets

For i2i and UDA, we use the german dataset Cityscapes [4] as source (clear), and a subset of the american Berkeley DeepDrive [34] (BDD) as target (rain). The bridge dataset, only used for the i2i, is a collection of Youtube video. We now detail each dataset.

Cityscapes.

We train on Cityscapes training set with 2975 pixel-wise annotated images, and evaluate on their validation set with 500 images. While we train on crops, we evaluate on full-size images, i.e. . The trainExtra set, with 19997 images, is also included in the domain bridge to further reduce the domain shift.

BDD-rainy.

We use the coarse weather annotations of BDD together with daylight annotation to obtain a subset we call BDD-rainy (i.e. rainy+daylight), i.e. . For training the rainy+daylight is extracted from the 100k split, while for validation only the 10k split is used. Obviously, duplicates present in both splits are removed. It has to be noted that, while daylight annotation is accurate, weather annotation is approximate and ”rainy” images may either be taken during or after a rain event, thus with or without drops on the windshield. This further increases complexity.

Bridge dataset.

Clear weather

Rain

Figure 6: Samples from the bridge dataset in different weather conditions. Note that the acquisition setup (camera position, optics, etc.) remains unaltered.

5 sequences (1280 720) were extracted from a single Youtube channel with keywords ”driving” (2 videos) for clear weather and ”driving rain”/”driving heavy rain” (2/1 video) for rainy scenarios. The choice of using videos from a unique channel further reduces domain gaps, ensuring same acquisition setup. Some samples are shown in Fig. 6. Also to maximize image diversity, videos are uniformly sub-sampled into 2x6026 clear weather images and 3x9294 rainy images, leading to a total of 39934 frames.

Network LPIPS IS
Real images 0.7137 -
CycleGAN [39] 0.1146 1.15
MUNIT [12] 0.3534 1.92
MUNIT-Bridged 0.2055 1.69
(a) GAN metrics
Network mIoU
Baseline 31.67
CycleGAN [39] 35.09
MUNIT [12] 20.78
MUNIT-Bridged 35.18
(b) Semantic Segmentation
Table 1: Quantitative evaluation on translated image realism, diversity, and semantic segmentation effectiveness.

4.1.2 Networks details

Image-to-image translation

During training, the images are downsampled to be 720 pixels in height, and cropped to 480 480 resolution. The network is trained for 200k iterations, with batch size 1. Adam is used as optimizer, with learning rate 1e-4, , .

Segmentation.

We use Light-weight Refinenet [23]

with Resnet-101 as backbone, pretrained on the full-size Cityscapes dataset. The refining is achieved by training for 100 epochs on 512

512 crops, after downscaling images to for GPU memory constraints. We employ data augmentation for the training process, with random rescaling between a factor 0.5 and 2, and random horizontal flipping. The batch size used is 6. We use the SGD optimizer with different learning rates for the encoder (1e-4) and the decoder (1e-3). The momentum is set to 0.9, and the learning rate is divided by 2 every 33 epochs. When pseudo-labels are added to the training, we further refine the network for 70 additional epochs, with constant learning rate divided by 10 w.r.t the initial values. The parameter is initialized to 0.8 and estimated by SGD as well, with learning rate 0.01 and momentum 0.9.

Image

Cyclegan [39]

MUNIT [12]

Ours

MUNIT-Bridged

Ours

MUNIT-Bridged

Ours

MUNIT-Bridged

Figure 7: Qualitative comparison between state-of-the-art architectures for i2i in the clear rain transformation.

Image

Ground truth

Baseline

AdaptSegNet [33]

BDL [16]

Ours

Ours + OMS

Ours + OMS + WPL

Figure 8: Comparison of our method with the state-of-the-art for semantic segmentation UDA.
Method mIoU road sidewalk building wall fence pole t light t sign veg terrain sky person rider car truck bus train m. bike bike
Baseline 31.67 77.40 39.95 61.20 12.01 24.76 23.68 13.21 24.11 58.33 27.18 78.86 24.73 12.78 63.34 24.01 28.43 0.00 4.76 2.90
AdaptSegNet [33] 33.44 82.23 39.85 62.06 9.84 17.73 20.39 10.91 22.47 66.30 22.81 76.54 32.24 38.49 68.95 13.08 30.31 0.00 17.97 3.26
BDL [16] 39.60 83.18 48.78 73.93 30.87 27.33 26.03 15.10 26.05 72.63 26.08 88.01 28.59 26.59 76.37 43.31 50.11 0.00 7.38 2.10
Ours 35.18 79.00 37.23 62.36 8.60 14.78 20.98 11.94 22.92 68.02 13.11 82.55 38.96 44.61 72.34 29.10 39.40 0.00 19.32 3.16
Ours + OMS 39.72 82.53 44.51 69.97 20.29 22.91 28.93 14.02 29.17 74.32 28.98 83.53 36.75 32.80 71.29 43.03 46.34 0.00 21.80 3.54
Ours + OMS + WPL 40.04 84.03 44.09 70.51 24.10 23.02 28.31 14.08 30.07 75.31 27.89 83.49 39.10 33.63 74.70 48.60 49.34 0.00 6.77 3.80
Table 2: State-of-the-art comparison. OMS refers to Online Multimodal Style-sampling. WPL is the Weighted Pseudo Labels strategy.
Pseudo-labels mIoU target
None 39.77
Batch-wise 38.23
WPL (Ours) 40.04
Table 3: Comparison of various Pseudo Labeling strategies: Batch-wise, with our WPL, or with None.

4.2 Bridged image-to-image translation

We evaluate our bridged i2i (Sec. 3.1) on the Cityscapes to BDD-rainy translation task, and compare results against the recent CycleGAN [39] and MUNIT [40]. As stated, our i2i uses a MUNIT based and is refered as MUNIT-bridged. It is trained on the bridged versions of the two datasets. Training follow details from Sec. 4.1.2, except for CycleGAN that follows the original implementation222[39] claims that best performance are obtained keeping constant the learning rate for half the training process (100k iterations in this case) to 2e-4 and then linearly decreasing to 0..
We argue - like others - that GAN metrics aren’t appropriate for such evaluation. Thus, we report qualitative evaluation and segmentation task evaluation, together with usual GAN metrics.

Qualitative evaluation.

Fig. 7 shows randomly selected samples from the Cityscapes validation set333For MUNIT and our method, MUNIT-bridged, we also randomize the style.. It is visible that both CycleGAN and original MUNIT method fails at modeling the rain appearance, probably due to the large domain gap. In details, CycleGAN brings no realistic changes to the scene appearance, only adjusting color-levels in the image. Original MUNIT, instead, seem to have collapsed and fails to produce significant outputs, probably due instability related to the domain gap. Conversely, our MUNIT-bridged model is the only one able to add realistic traits of rain in the synthetic images, thanks to the domain bridge.

Quantitative evaluation.

We compute GAN metrics following usual practices from [12] and report results in Tab. 6(a). The LPIPS distance [35, 31] measures the image diversity [12], while the Inception Score evaluate both quality and diversity [29]. In details, LPIPS is the average on 19 paired translation of 100 images, and we report the diversity of real data in the target dataset as upper bound

. Inception Score uses the InceptionV3 network previously trained to classify source and target images.


Overall, we successfully improve performances over CycleGAN in both metrics, but original MUNIT has significant higher performance. However, the images generated by MUNIT are evidently unrealistic (cf. Fig. 7) and thus we argue that GAN metrics are unreliable which is in fair alignment with [1] advocating that Inception Score is uncorrelated with image quality.

For more comprehensive evaluation, we train a segmentation task on GAN translated clearrain images, and evaluate the standard mIoU metric on real rain images, reporting results in Tab. 6(b). Note that for fair comparison, we only sample a single style for MUNIT-based models, and report results when only trained on clear images as baseline. If the domain gap were reduced by the GAN translations, an improvement should be visible.
Instead, from the table, training on the original MUNIT-translated dataset leads to a significant performance decrease disproving the high GAN metrics. Finally, our method outperforms CycleGAN by a little margin although CycleGAN fails to produce good quality images. Conversely, our method simultaneously reduces the domain shift and increases realism.

4.3 Unsupervised Domain Adaptation

We now evaluate our UDA contributions encompassing our i2i translation methodology and compare with AdaptSegNet [33] and BDL [16]

, the best found recent works. For fair comparison and given architectural similarities, BDL was adapted to work with same segmentation network, data augmentation policy and hyperparameters detailed in Sec. 

4.1.2.

Quantitative results are shown in Tab. 2 where Ours refers to UDA with only our domain-bridge i2i translation, Ours+OMS using also our Online Multimodal Style-sampling, and Ours+OMS+WPL using also our Weighted Pseudo Label. baseline refers to the training without any UDA. Overall, our methodology performs on par () with BDL the best state-of-the-art, using a much simpler domain adaptation method, and significantly better () than AdaptSegNet. Studying the contributions of our OMS and WPL contributions, all components are necessary to reach the best performances. Qualitative evaluation on the target dataset is shown in Fig. 8, and in fair alignment with quantitative metrics.

Weighted Pseudo Labels.

We evaluate the effectiveness of our WPL proposal and report results in Tab. 3, comparing similar training with either WPL (Ours), batch-wise Pseudo-Label444For batch-wise Pseudo-Label implementation, we compute optimal threshold per class and per batch., or None. For all, the training is performed using as target the whole BDD100k train set (removing duplicates from 10k split) together with the rainy sequences from Domain-bridge dataset, resulting in over 90k images. Performance is reported on target BDD-rainy. We do not compare against offline Pseudo Label, as this would be impractical with such big dataset, and this evaluation is partly encompassed in BDL comparison (cf. Tab. 2). For WPL, we empirically set , (Eq. 7) to balance contributions and (Eq. 10 & 11).

From results in Tab. 3, WPL performs the best and batch-wise Pseudo Label third. In fact, the performance decrease for batch-wise (compared to no self-supervision) may be explained since best batch pixels are used as pseudo-labels, thus possibly implying some incorrect self-supervision in case of low batch accuracy. Instead, our WPL boosts the mIoU on the target dataset which is expected due to its expansion behavior.

5 Conclusions

In this work, we introduced a novel approach to generate realistic rainy images with an i2i network, while preserving traits of adverse weather that are typically ignored by state-of-the-art architectures. Then, we extended our system and demonstrated its performances in UDA for semantic segmentation, and with a simple pipeline we obtained on par performances w.r.t. the state-of-the-art. Finally, we introduced a novel pseudo labeling strategy that works with an unlimited number of images, and has an optimizable weight parameter used to guide region growing. For future work, we plan to extend our pseudo labeling approach with class-wise thresholds.

References

  • [1] S. T. Barratt and R. Sharma. A note on the inception score. ArXiv, abs/1801.01973, 2018.
  • [2] M. Biasetton, U. Michieli, G. Agresti, and P. Zanuttigh. Unsupervised domain adaptation for semantic segmentation of urban scenes. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    , pages 0–0, 2019.
  • [3] Y. Chen, W. Li, X. Chen, and L. V. Gool. Learning semantic segmentation from synthetic data: A geometrically guided input-output adaptation approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1841–1850, 2019.
  • [4] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele.

    The cityscapes dataset for semantic urban scene understanding.

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
  • [5] C. Creus and G. A. Patow. R4: Realistic rain rendering in realtime. Computers & Graphics, 37(1-2):33–40, 2013.
  • [6] D. Dai, C. Sakaridis, S. Hecker, and L. Van Gool. Curriculum model adaptation with synthetic and real data for semantic foggy scene understanding. International Journal of Computer Vision, pages 1–23, 2019.
  • [7] K. Garg and S. K. Nayar. Photorealistic rendering of rain streaks. In ACM Transactions on Graphics (TOG), volume 25, pages 996–1002. ACM, 2006.
  • [8] S. S. Halder, J.-F. Lalonde, and R. de Charette. Physics-based rendering for improving robustness to rain. arXiv preprint arXiv:1908.10335, 2019.
  • [9] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell. Cycada: Cycle-consistent adversarial domain adaptation. arXiv preprint arXiv:1711.03213, 2017.
  • [10] J. Hoffman, D. Wang, F. Yu, and T. Darrell. Fcns in the wild: Pixel-level adversarial and constraint-based adaptation. arXiv preprint arXiv:1612.02649, 2016.
  • [11] X. Hu, C.-W. Fu, L. Zhu, and P.-A. Heng. Depth-attentional features for single-image rain removal. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [12] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz. Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 172–189, 2018.
  • [13] A. Iscen, G. Tolias, Y. Avrithis, and O. Chum.

    Label propagation for deep semi-supervised learning.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5070–5079, 2019.
  • [14] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros.

    Image-to-image translation with conditional adversarial networks.

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
  • [15] D.-H. Lee.

    Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks.

  • [16] Y. Li, L. Yuan, and N. Vasconcelos. Bidirectional learning for domain adaptation of semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6936–6945, 2019.
  • [17] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In Advances in neural information processing systems, pages 700–708, 2017.
  • [18] M.-Y. Liu, X. Huang, A. Mallya, T. Karras, T. Aila, J. Lehtinen, and J. Kautz. Few-shot unsupervised image-to-image translation. arXiv preprint arXiv:1905.01723, 2019.
  • [19] Y. Luo, L. Zheng, T. Guan, J. Yu, and Y. Yang. Taking a closer look at domain shift: Category-level adversaries for semantics consistent domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2507–2516, 2019.
  • [20] U. Michieli, M. Biasetton, G. Agresti, and P. Zanuttigh. Adversarial learning and self-teaching techniques for domain adaptation in semantic segmentation, 2019.
  • [21] S. Mo, M. Cho, and J. Shin. Instagan: Instance-aware image-to-image translation. arXiv preprint arXiv:1812.10889, 2018.
  • [22] Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, and K. Kim. Image to image translation for domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4500–4509, 2018.
  • [23] V. Nekrasov, C. Shen, and I. D. Reid. Light-weight refinenet for real-time semantic segmentation. In BMVC, 2018.
  • [24] H. Porav, T. Bruls, and P. Newman. I can see clearly now: Image restoration via de-raining. arXiv preprint arXiv:1901.00893, 2019.
  • [25] P. Z. Ramirez, A. Tonioni, and L. Di Stefano. Exploiting semantics in adversarial training for image-level domain adaptation. In 2018 IEEE International Conference on Image Processing, Applications and Systems (IPAS), pages 49–54. IEEE, 2018.
  • [26] P. Z. Ramirez, A. Tonioni, S. Salti, and L. Di Stefano. Learning across tasks and domains. arXiv preprint arXiv:1904.04744, 2019.
  • [27] E. Romera, L. M. Bergasa, K. Yang, J. M. Alvarez, and R. Barea. Bridging the day and night domain gap for semantic segmentation. In 2019 IEEE Intelligent Vehicles Symposium (IV), pages 1312–1318. IEEE, 2019.
  • [28] P. Rousseau, V. Jolivet, and D. Ghazanfarpour. Realistic real-time rain rendering. Computers & Graphics, 30(4):507–518, 2006.
  • [29] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Advances in neural information processing systems, pages 2234–2242, 2016.
  • [30] S. Sankaranarayanan, Y. Balaji, A. Jain, S. Nam Lim, and R. Chellappa. Learning from synthetic data: Addressing domain shift for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3752–3761, 2018.
  • [31] Z. Shen, M. Huang, J. Shi, X. Xue, and T. Huang. Towards instance-level image-to-image translation. arXiv preprint arXiv:1905.01744, 2019.
  • [32] N. Tatarchuk. Artist-directable real-time rain rendering in city environments. In ACM SIGGRAPH 2006 Courses, pages 23–64. ACM, 2006.
  • [33] Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang, and M. Chandraker. Learning to adapt structured output space for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7472–7481, 2018.
  • [34] H. Xu, Y. Gao, F. Yu, and T. Darrell. End-to-end learning of driving models from large-scale video datasets. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2174–2182, 2017.
  • [35] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang.

    The unreasonable effectiveness of deep features as a perceptual metric.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018.
  • [36] Y. Zhang, P. David, and B. Gong. Curriculum domain adaptation for semantic segmentation of urban scenes. In Proceedings of the IEEE International Conference on Computer Vision, pages 2020–2030, 2017.
  • [37] Y. Zhang, Z. Qiu, T. Yao, D. Liu, and T. Mei. Fully convolutional adaptation networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6810–6818, 2018.
  • [38] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu. Deep mutual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4320–4328, 2018.
  • [39] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.
  • [40] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman. Toward multimodal image-to-image translation. In Advances in Neural Information Processing Systems, pages 465–476, 2017.
  • [41] Y. Zou, Z. Yu, B. Vijaya Kumar, and J. Wang. Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In Proceedings of the European Conference on Computer Vision (ECCV), pages 289–305, 2018.