DLOW: Domain Flow for Adaptation and Generalization

12/13/2018 ∙ by Rui Gong, et al. ∙ ETH Zurich 12

In this work, we propose a domain flow generation(DLOW) approach to model the domain shift between two domains by generating a continuous sequence of intermediate domains flowing from one domain to the other. The benefits of our DLOW model are two-fold. First, it is able to transfer source images into different styles in the intermediate domains. The transferred images smoothly bridge the gap between source and target domains, thus easing the domain adaptation task. Second, when multiple target domains are provided in the training phase, our DLOW model can be learnt to generate new styles of images that are unseen in the training data. We implement our DLOW model based on the state-of-the-art CycleGAN. A domainness variable is introduced to guide the model to generate the desired intermediate domain images. In the inference phase, a flow of various styles of images can be obtained by varying the domainness variable. We demonstrate the effectiveness of our approach for both cross-domain semantic segmentation and the style generalization tasks on benchmark datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

page 14

page 15

page 16

page 17

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The domain shift problem is drawing more and more attention in recent years [19, 58, 50, 48, 13, 6]. In particular, there are two tasks that are of interest in computer vision community. One is the domain adaptation problem, where the goal is to learn a model for a given task from a label-rich data domain (i.e., source domain) that performs well in a label-scarce data domain (i.e., target domain). The other one is the image translation problem, where the goal is to transfer images in the source domain to mimic the image style in a target domain.

Figure 1: Illustration of data flow generation. Traditional image translation methods directly map the image from the source domain to the target domain, while our DLOW model is able to produce a sequence of intermediate domains shifting from the source domain to the target domain.

Generally, most existing works focus on the target domain only. They aim to learn models that well fit the target data distribution, e.g., achieving good classification accuracy in the target domain, or transferring source images into the target style. In this work, we instead are interested in the intermediate domains between the source and target domains. We propose a new domain flow generation (DLOW) approach, which is able to translate images from the source domain into an arbitrary intermediate domain between source and target domains. As shown in Fig 1, by translating a source image along the domain flow from the source domain to the target domain, we obtain a sequence of images that naturally characterize the distribution shift from the source domain to the target domain.

The benefits of our DLOW approach are two-fold. First, those intermediate domains are helpful to bridge the distribution gap between two domains. By translating images into intermediate domains, those translated images can be employed to ease the domain adaptation task. We show that the traditional domain adaptation methods can be boosted to achieve better performances in target domain with intermediate domain images. Moreover, the obtained models also exhibit good generalization ability on new datasets that are unseen in the training phase, benefiting from the diverse intermediate domain images.

Second, our DLOW model can be used for style generalization. Traditional image-to-image translation works 

[58, 25, 27, 35]

focus on learning a deterministic one-to-one mapping that transfers a source image into the target style. In contrast, our DLOW model allows to translate a source image into an intermediate domain that are related to multiple target domains. For example, when performing the photo to painting transfer, instead of obtaining a Monet or Van Gogh style, our DLOW model could produce a painting with mixed styles of Van Gogh, Monet, etc. Such mixture can be customized arbitrarily in the inference phase by simply adjusting an input vector that encodes the relatedness to different domains.

We implement our DLOW model based on CycleGAN [58], which is one of the state-of-the-art unpaired image-to-image translation methods. We augment the CycleGAN to include an additional input of domainness variable. On one hand, the domainness variable is injected into the translation network by using a conditional instance normalization layer to affect the style of output images. On the other hand, it is also used as weights on discriminators to balance the relatedness of the output images to source and target domains. For multiple target domains, the domainness variable is extended as a vector containing the relatedness to all target domains.

We evaluate our DLOW model with two tasks, mixed-style image translation, and domain adaptation. For the first task, we show that our learnt model is able to translate a source image into an arbitrary mixture of multiple styles. For the second task, we are able to further improve the state-of-the-art cross-domain semantic segmentation methods by using the translated images in intermediate domains as training data. Extensive results on benchmark dataasets demonstrate the effectiveness of our proposed model.

2 Related Work

Image to Image Translation Our work is related to the image-to-image translation works. The image-to-image translation task aims at translating the image from one domain into another domain. Inspired by the success of Generative Adversarial Networks(GANs) [15], many works have been proposed to address the image-to-image translation based on GANs [25, 52, 58, 35, 36, 18, 59, 24, 1, 6, 30, 53, 34]. The early works [25, 52] assumed that paired images between two domains are available, while the recent works such as CycleGAN [58], DiscoGAN [27] and UNIT [35] employed cycle consistency loss to learn the mapping without using paired images. However, those works focus on learning deterministic image-to-image mappings. Once the model is learnt, a source image can only be transferred to a fixed target style, and v.v. A few recent works [36, 18, 59, 24, 1, 6, 30, 53, 34] concentrates on learning a unified model to translate images into different styles. For instance, the Augmented CycleGAN [1] proposed to inject a noise input into CycleGAN with Conditional Normalization (CN) [23, 8]

, leading to sytle variance of translated images. The StarGAN model 

[6] adds the mask vector to the conditional GANs to train a single model which produces styles to mimic multiple datasets. However, they either exploit the style variance within a single domain, or train a unifed model to switch among different target domains. It is unclear for those methods how to generalize to a new unseen domain. In constrast, when multiple target domains are aviable, our work is able to transfer images into an arbitary intermediate domain relatd to those multiple domains. This allows us to translate the image into a new sytle that is unseen in the training data.

Domain Adaptation and Generalization Our work is also related to the domain adaptation and generalization works. Domain adaptation aims to utilize a labeled source domain to learn a model that performs well on a unlabeled target domain [11, 16, 10, 51, 26, 3, 28, 14, 29]. Domain generalization is a similar problem, which aims to learn a model that could be generalized to an unseen target domain by using multiple labeled source domains [39, 13, 42, 38, 41, 31, 33, 32].

Our work is partially inspired by SGF [16] and GFK [14], which have shown that the intermediate domains between source and target domains are useful for addressing the domain adaptation problem. They represented each domain as a subspace, and then connected them on Grassmannian manifold to model intermediate domains. Different from them, we model the intermediate domains by directly translate images on pixel level. This allows us to easily improve the existing deep domain adaptation models by using the translated images as training data. Moreover, our model can also be applied to image-level domain generalization by generating mixed-style images.

Recently, there is an increasing interests to apply domain adaptation techniques for semantic segmentation from synthetic data to the real scenario [20, 19, 5, 61, 37, 22, 9, 43, 47, 49, 21, 44, 56, 50, 40, 46, 48, 60]. Most of those works conduct the domain adaptation by adversarial training on the feature level with different priors. The recent Cycada [19] also show that it is beneficial to perform pixel-level domain adaptation firstly by transferring source image into the target style based on the image-to-image translation methods like CycleGAN [58]. However, those methods address domain shift by focusing on adapting to only the target domain. In contrast, we aim to perform pixel-level adaptation by transferring source images to a flow of intermediate domains, which is proven to be more effective than those that focus on only target domain in our experiments. Moreover, our model can also be used to further improve the existing feature-level adaptation methods.

3 Domain Flow Generation

In this section, we introduce domain flow generation (DLOW) model for translating source images into intermedaite domains that bridge the source and target domains.

3.1 Problem Statement

In the domain shift problem, we are given a source domain and a target domain containing samples from two different distributions and , respectively. Denoting as a source domain sample and , we have , , and .

Such distribution mismatch usually leads to a significant performance drop when applying the model trained on to the new target domain . Many works have been proposed to address the domain shift for different vision applications. A group of recent works aim to reduce the distribution difference on the feature level by learning domain-invariant features [11, 16, 28, 14], while others work on the image level to transfer source images to mimic the target domain style [58, 35, 59, 24, 1, 6].

In this work, we also propose to address the domain shift problem on image level. However, different from existing works that focus on transferring source images into only the target domain, we instead transfer them into all intermediate domains that connect source and target domains. This is partially motivated by the previous works [16, 14], which have shown that the intermediate domains between source and target domains are useful for addressing the domain adaptation problem.

In the follows, we first briefly review the conventional image-to-image translation model CycleGAN. Then, we formulate the intermediate domain adaptation problem based on the data distribution distance. Next, we develop our DLOW model based on the CycleGAN model. We then show the benefits of our DLOW model for two applications: 1) how to improve existing domain adaptation models with the images generated from DLOW model, and 2) how to transfer images into arbitrarily mixed styles when there are multiple target domains.

3.2 The CycleGAN Model

We build our model upon the state-of-the-art CycleGAN model [58] which was proposed for unpaired image-to-image translation. Formally, the CycleGAN model learns two mappings between and , i.e., which transfers the images in into the style of , and which acts in the inverse direction. We take the direction as an example to explain CycleGAN.

To transfer source images into the target style and also preserve the semantics, the CycleGAN employed an adversarial training module and a reconstruction module, respectively. In particular, the adversarial training module is to align the image distributions for two domains, such that the style of mapped images matches the target domain. Let us denote as the discriminator, which attempts to distinguish the translated images and the target images. Then the objective function of the adversarial training module can be written as,

Moreover, the reconstruction module is to ensure the mapped image to preserve the semantic content of the original image . This is realized by enforcing a cyclic consistency loss such that is able to recover when being mapped back to the source style, i.e.,

(2)

Similar modules are also applied to the direction. By jointly optimizing all modules, the CycleGAN model is able to transfer source images into the target style and v.v.

Figure 2: Illustration of domain flow. Many possible paths (the green dash lines) connect source and target domains, while the domain flow is the shortest one (the red line). An intermediate domain (the blue dot) is the point at the domain flow that keeps the right distances to two domains.
Figure 3: Our DLOW model consists of three modules: (a) the generator takes domainness as additional input to control the image translation; (b) takes to reconstruct the source image; (c) is used as target values for learning regressor. Three modules are trained jointly.

3.3 Modeling Intermediate Domains

In our task, we aim to translate the source images not only into the target domain, but also into all intermediate domains that connect the source and target domains. In particular, let us denote an intermediate domain as , where is a continous variable which models the relatedness to source and target domains. We refer to as the domainness of intermediate domain. When , the intermediate domain is identical to the source domain ; and when , it is identical to the target domain . By varying in the range of , we thus obtain a sequence of intermediate domains that flows from to .

There are many possible paths to connect the source and target domains. As shown in Fig 2, assuming there is a manifold of domains, where a domain with given data distribution can be seen as a point residing at the manifold. We expect the domain flow to be the shortest geodesic path connecting and . Moreover, given any , the distance from to should also be proportional to the distance between to by the value of . Denoting the data distribution of as , we expect that

(3)

where is a valid distance measurement over two distributions.Thus, generating an intermediate domain for a given becomes to find the point satisfying Eq. (3) that is closet to and , which leads to minimize the following loss,

(4)

As shown in [2], many types of distance have been exploited for image generation and image translation. The adversarial loss in Eq. (3.2) can be seen as a lower bound of the Jessen-Shannon divergence. We also use it for measuring distribution distance in this work.

3.4 The DLOW Model

We now develop our DLOW model to generate intermediate domains. Given a source image , and a domainness parameter , our task is to transfer into the intermediate domain with the distribution that minimizes the objective in Eq. (4). We take the direction as an example, and the other direction can be similarly applied.

In our DLOW model, a generator is no longer to directly transfer to the target domain , but to move towards it. The interval of such moving is controlled by the domainness variable . Let us denote as the domain of , then the generator in our DLOW model can be represented as where the input is a joint space of and .

Adversarial Loss: As discussed in Section 3.3, We employ the adversarial loss as the distribution distance measurement to control the relatedness of an intermediate domain to the source and target domains. Specifically, we introduce two discriminators, to distinguish and , and to distinguish and , respectively. Then, the adversarial losses between and and can be written respectively as,

By using the above losses to model and in Eq. (4), we arrive at the following loss,

(7)

Image Cycle Consistency Loss: Similarly as in CylceGAN, we also apply a cyclic consistency loss to ensure the semantic content is well-preserved in the translated images. Let us denote as the generator on the other direction, which transfers a sample from the target domain towards the source domain by a interval of . Since acts in an inverse way to , we can use it to recover from the translated version , which gives the following loss,

(8)

Domainness Cycle Consistency Loss: To guarantee that the translated image correctly encodes the information of the domainness parameter , we introduce a regressor to reconstruct the domainness parameter . In particular, is expected to output for source images, for target images, and for images in . We use the cross entropy loss for source and target images, and the square loss for , and arrive at the domainness cycle consistency loss as follows,

(9)

Full Objective: We integrate the losses defined above, the full objective can be defined as:

(10)

where and are hyper-parameters used to balance the adversarial loss, the image cycle consistency loss and the domainness cycle consistency loss in the training process.

Similar loss can be defined for the other direction . Due to the usage of adversarial loss , the training is performed in an alternating manner. We first minimize the full objective with regard to the generators and regressors, and then maximize it with regard to the discriminators.

Implementation: We illustrate the network structure of direction of our DLOW model in Fig 3. A figure of the complete model is provided in the Appendix. First, the domainness parameter is taken as the input of the generator . This is implemented with the Conditional Instance Normalization (CN) layer [1, 23]. We first use one deconvolution layer to map the domainness parameter to the vector with dimension , and then use this vector as the input for the CN layer. Moreover, the domainness parameter also plays the role of weighting discriminators to balance the relatedness of the generated images to different domains. It is also used as input in the image cycle consistency module, as well as label for the domainness cycle consistency module. During the training phase, we randomly generate the domainess parameter for each input image. As inspired by [x], we forces the domainness parameter

to obey the beta distribution, i.e.

, where is fixed as , and is a function of the training step with being the current iteration and being the total number of iterations. In this way, tends to be sampled more likely as small values at the beginning, and gradually shift to larger values at the end, which gives slightly more stable training than uniform sampling.

3.5 Boosting Domain Adaptation Models

With the DLOW model, we are able to translate each source image into an arbitary intermediate domain . Let us denote the source dataset as where is the label of . By inputting each of the image combined with

randomly sampled from the uniform distribution

, we then obtain a translated dataset where is the translated version of . The images in spread along the domain flow from source to target domain, therefore, becomes much more diverse. Using as the training data is helpful to learn a domain-invariant models for computer vision tasks. In Section 4.1, we demonstrate that model trained on achieves good performance for the cross-domain semantic segmentation problem.

Moreover, the translated dataset can also be used to boost the existing adversarial training based domain adaptation approaches. Images in fill the gap between the source and target domains, and thus ease the domain adaptation task. Taking semantic segmentation as an example, a typical way is to append a discriminator to the segmentation model, which is used to distinguish the source and target samples. Using the adversarial training strategy to optimize the discriminator and the segmentation model, the segmentation model is trained to be more domain-invariant.

As shown in Fig 4, we replace the source dataset with the translated version , and apply a weight to the adversarial loss. The motivation is as follows, for each sample , if the domainness is higher, it is closer to the target domain, then the weight of adversarial loss can be reduced. Otherwise, we should enhance the loss weight.

Figure 4: Intermediate domain images are used as source dataset, and the adversarial loss is weighted by domainness.

3.6 Style Generalization

Most existing image-to-image translation works learn a deterministic mapping between two domains. After learning the model, source images can only be translated to a fixed style. In contrast, our DLOW model take an random to translate images into various styles. When multiple target domains are provided, it is also able to transfer the source image into a mixture of different target styles. In other words, we are able to generalize to an unseen intermediate domain that are related to existing domains.

In particular, suppose we have target domains, denoted as . Accordingly, the domainness variable is expanded as a -dim vector with . Each elelment represents the relatedness to the -th target domain. To map an image from the source domain to the intermediate domain defined by , we need to optimize the following objective,

(11)

where is the distribution of the intermediate domain, is the distribution of . The network structure can be easily adjusted from our DLOW model to optimize the above objective. We leave the details in the Appendix due to the space limitation.

(a)
(b)
(c)
(d)
(e)
Figure 5: Examples of intermediate domain images from GTA5 to Cityscapes
GTA5 Cityscapes
Method

road

sidewalk

building

wall

fence

pole

traffic light

traffic sign

vegetation

terrian

sky

person

rider

car

truck

bus

train

motorbike

bicycle

mIoU
NonAdapt[50] 75.8 16.8 77.2 12.5 21.0 25.5 30.1 20.1 81.3 24.6 70.3 53.8 26.4 49.9 17.2 25.9 6.5 25.3 36.0 36.6
CycleGAN[19] 81.7 27.0 81.7 30.3 12.2 28.2 25.5 27.4 82.2 27.0 77.0 55.9 20.5 82.8 30.8 38.4 0.0 18.8 32.3 41.0
DLOW() 88.5 33.7 80.7 26.9 15.7 27.3 27.7 28.3 80.9 26.6 74.1 52.6 25.1 76.8 30.5 27.2 0.0 15.7 36.0 40.7
DLOW 87.1 33.5 80.5 24.5 13.2 29.8 29.5 26.6 82.6 26.7 81.8 55.9 25.3 78.0 33.5 38.7 0.0 22.9 34.5 42.3
Table 1: Results of segmentation segmentation on the CityScapes dataset based on DeepLab-v2 model with ResNet-101 backbone using the images translated with different models.
Cityscapes KITTI WildDash BDD100K
Original [50] 42.4 30.7 18.9 37.0
DLOW 44.8 36.6 24.9 39.1
Table 2: Comparison of the performance of AdaptSegNet [50] when using original source images and intermediate domain images translated with our DLOW model for semantic segmention under domain adaptation (1st column) and domain generalization (2nd to 4th columns) scenarios

4 Experiments

In this section, we demonstrate the benefits of our DLOW model with two tasks. In the first task, we address the domain adaptation problem, and train our DLOW model to generate the intermediate domain sample to boost the domain adaptation performance. In the second task, we consider the style generalization problem, and train our DLOW model to transfer images into new styles that are unseen in the training data.

4.1 Domain Adaptation and Generalization

4.1.1 Experiments Setup

For the domain adaptation problem, we follow [20, 19, 5, 61] to conduct experiments on the urban scene semantic segmentation by learning from synthetic data to real scenario. The GTA5 dataset [45] is used as the source domain while the Cityscapes dataset [7] as the target domain. Moreover, we also evaluate the generalization ability of learnt segmentation models to unseen domains, for which we take the KITTI [12], WildDash [55] and BDD100K [54] datasets as additional unseen datasets for evaluation.

Cityscapes is a dataset consisting of urban scene images taken from some European cities. We use the training images without annotation as unlabeled target samples in training phase, and 500 validation images with annotation for evaluation, which are densely labelled with 19 classes.

GTA5 is a dataset consisting of densely labelled synthetic frames generated from the computer game whose scenes are based on the city of Los Angeles. The annotations of the images are compatible with the Cityscaps.

KITTI is a dataset consisting of images taken from mid-size city of Karlsruhe. We use 200 validation images densely labeled and compatible with Cityscapes.

WildDash is a dataset covers images from different sources, different environments(place, weather, time and so on) and different camera characteristics. We use 70 labeled and Cityscapes annotation compatible validation images.

BDD100K is a driving dataset covering diverse images taken from US whose label maps are with training indices specified in Cityscapes. We use densely labeled images for validation in our experiment.

In this task, we first train our proposed DLOW model using the GTA5 dataset as the source domain, and Cityscapes as the target domain. Then, we generate a translated GTA5 dataset with the learnt DLOW model. Each source image is inputted into DLOW with a random domainness variable . The new translated GTA5 dataset contains exactly the same number images as the original one, but the styles of images randomly drift from the synthetic style to the real style. We then use the translated GTA dataset as the new source domain for training segmentation models.

We implement our DLOW model based on Augmented CycleGAN [1] and CyCADA [19]. Following their setup, all images are resized to and the crop size is set as . When training the DLOW model, the image cycle loss weight is 10 and the domain cycle loss weight is 1. The learning rate is fixed as 0.0002. For the segmentation network, we use the AdaptSegNet [50] model, which is based on DeepLab-v2 [4] with ResNnet-101 [17] as the backbone network. The training images are resized to . We follow the exact the same training policy as in the AdaptSegNet.

4.1.2 Experimental Results

Intermediate Domain Images: To verify the ability of our DLOW model to generate intermediate domain images, after learning the model, we fix the input source image, and vary the domainness parameter from 0 to 1. A few examples are shown in Fig 5. It can be observed that the styles of translated images gradually shift from the synthetic style of GTA5 to the real style of Cityscapes, which demonstrates the DLOW model is capable of modeling the domain flow to bridge the source and target domains as expected.

Figure 6: Examples of style generalization results. The vectors above each images are the domainness.

Cross-Domain Semantic Segmentation: We further evaluate the usefulness of intermediate domain images in two settings. In the first setting, we compare with the CycleGAN model [58], which was used in the CycADA approach [19] for performing pixel-level domain adaptation. The difference between CycleGAN and our DLOW model is that CycleGAN transfers source images to mimic only the target style, while our DLOW model transfers source images into random styles flowing from the source domain to the target domain. We first obtain a translated version of the GTA5 dataset with each model. Then, we respectively use the two transalated GTA5 datasets to train DeepLab-v2 models, which are evaluated on the Cityscapes dataset for semantic segmentation. We also include the “NonAdapt” baseline which uses the original GTA5 images as training data, as well as a special case of our approach, “DLOW()”, where we set for all source images when making image translation using the learnt DLOW model.

The results are shown in Table 1. We observe that all pixel-level adaptation methods outperform the “NonAdapt” baseline, which verifies that image translation is helpful for training models for cross-domain semantic segmentation. Moreover, ‘DLOW()” is a special case of our model that directly translates source images into the target domain, which non-surprisingly gives comparable results as the CycADA-pixel method ( v.s. ). By further using intermedaite domain images, our DLOW model is able to improve the segmentation results from to , which demonstrates that intermediate domain images are helpful for learning a more robust domain-invariant segmentation model.

In the second setting, we further use intermediate domain images to improve the feature-level domain adpatation model. We conduct experiments based on the AdaptSegNet method [50], which is open source and has reported the state-of-the-art result for GTA5CityScapes. It consists of multiple levels of adversarial training, and we augment each level with the loss weight discussed in Section 3.5. The results are reported in Table 2. The “Original” method denotes the AdaptSegNet model that trained using GTA5 as the source domain, for which the results are obtained using their released pretrained model. The “DLOW” method is AdaptSegNet trained using translated dataset with our DLOW model. From the first column, we observe that the intermediate domain images are able to improve the AdaptSegNet model by from to . More interestingly, we show that the AdaptSegNet model with DLOW translated images also exhibits excellent domain generalization ability when being applied to unseen domains, which achieves significantly better results than the original AdaptSegNet model on the KITTI, WildDash and BDD100K datasets as reported in the second to the fourth columns, respectively. This shows that intermediate domain images are useful to improve the model’s cross-domain generalization ability.

4.2 Style Generalization

We conduct the style generalization experiment on the Photo to Artworks dataset[58], which consists of real photographs ( images) and artworks from Monet( images), Cezanne( images), Van Gogh( images) and Ukiyo-e( images). We use the real photographs as the source domain, and the remaining as four target domains. As discussed in Section 3.6, The domainness variable in this experiment is expanded as a vector meeting the condition . Also, and are corresponding to Monet, Van Gogh, Ukiyo-e and Cezanne, respectively. Each element can be seen as how much each style contributes to the final mixture style. For example, the domainness parameter , and represents the pure Monet style, Monet and Van Gogh equally mixture style and Four styles equally mixture style, respectively. In every 5 steps of the training, we set the domainness parameter as , , ,

and uniformly distributed random variable. The weight of the cycle loss in training is 10 and the learning rate is set as 0.002. And we train the model for 54 epochs. The qualitative results of the style generalization are shown in Fig 

6. From the qualitative results, it is shown that our DLOW model can translate the photo image to corresponding artworks with different styles. When varying the values of domainness vector, we could also successfully produce new styles related to different painting styles, which demonstrates the good generalization ability of our model to unseen domains. Note, different from  [57, 23] works, we do not need any reference image in the test phase, and the domainness vector can be changed instantly to generate different new styles of images. We provide more examples in Appendix.

5 Conclusion

In this paper, we have presented the DLOW model to generate intermediate domains for bridging different domains. The model takes a domainness variable (or domainness vector ) as the conditional input, and transfers images into the intermediate domain controlled by or . We demonstrate the benefits of our DLOW model in two scenarios. Firstly, for the cross-domain semantic segmentation task, our DLOW model could improve the performance of the pixel-level domain adaptation by taking the translated images in intermediate domains as training data. Secondly, our DLOW model also exhibits excellent style generalization ability for image translation and we are able to transfer images into a new style that is unseen in the training data. Extensive experiments on benchmark datasets have verified the effectiveness of our proposed model.

References

  • [1] A. Almahairi, S. Rajeswar, A. Sordoni, P. Bachman, and A. Courville. Augmented cyclegan: Learning many-to-many mappings from unpaired data. arXiv preprint arXiv:1802.10151, 2018.
  • [2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
  • [3] M. Baktashmotlagh, M. T. Harandi, B. C. Lovell, and M. Salzmann. Unsupervised domain adaptation by domain invariant projection. In Proceedings of the IEEE International Conference on Computer Vision, pages 769–776, 2013.
  • [4] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2018.
  • [5] Y. Chen, W. Li, and L. Van Gool. Road: Reality oriented adaptation for semantic segmentation of urban scenes. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 7892–7901, 2018.
  • [6] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [7] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele.

    The cityscapes dataset for semantic urban scene understanding.

    In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [8] V. Dumoulin, J. Shlens, and M. Kudlur. A learned representation for artistic style. Proc. of ICLR, 2017.
  • [9] A. Dundar, M.-Y. Liu, T.-C. Wang, J. Zedlewski, and J. Kautz. Domain stylization: A strong, simple baseline for synthetic to real image domain adaptation. arXiv preprint arXiv:1807.09384, 2018.
  • [10] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars. Unsupervised visual domain adaptation using subspace alignment. In Proceedings of the IEEE international conference on computer vision, pages 2960–2967, 2013.
  • [11] Y. Ganin and V. S. Lempitsky.

    Unsupervised domain adaptation by backpropagation.

    In ICML, 2015.
  • [12] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  • [13] M. Ghifary, W. Bastiaan Kleijn, M. Zhang, and D. Balduzzi.

    Domain generalization for object recognition with multi-task autoencoders.

    In Proceedings of the IEEE international conference on computer vision, pages 2551–2559, 2015.
  • [14] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic flow kernel for unsupervised domain adaptation. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2066–2073. IEEE, 2012.
  • [15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [16] R. Gopalan, R. Li, and R. Chellappa. Domain adaptation for object recognition: An unsupervised approach. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 999–1006. IEEE, 2011.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [18] Z. He, W. Zuo, M. Kan, S. Shan, and X. Chen. Arbitrary facial attribute editing: Only change what you want. arXiv preprint arXiv:1711.10678, 2017.
  • [19] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell. CyCADA: Cycle-consistent adversarial domain adaptation. In J. Dy and A. Krause, editors,

    Proceedings of the 35th International Conference on Machine Learning

    , volume 80 of Proceedings of Machine Learning Research, pages 1989–1998, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
  • [20] J. Hoffman, D. Wang, F. Yu, and T. Darrell. Fcns in the wild: Pixel-level adversarial and constraint-based adaptation. arXiv preprint arXiv:1612.02649, 2016.
  • [21] W. Hong, Z. Wang, M. Yang, and J. Yuan. Conditional generative adversarial network for structured domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1335–1344, 2018.
  • [22] H. Huang, Q. Huang, and P. Krähenbühl. Domain transfer through deep activation matching. In European Conference on Computer Vision, pages 611–626. Springer, 2018.
  • [23] X. Huang and S. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV, 2017.
  • [24] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz. Multimodal unsupervised image-to-image translation. In ECCV, 2018.
  • [25] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros.

    Image-to-image translation with conditional adversarial networks.

    CVPR, 2017.
  • [26] I.-H. Jhuo, D. Liu, D. Lee, and S.-F. Chang. Robust visual domain adaptation with low-rank reconstruction. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2168–2175. IEEE, 2012.
  • [27] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim. Learning to discover cross-domain relations with generative adversarial networks. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1857–1865, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
  • [28] E. Kodirov, T. Xiang, Z. Fu, and S. Gong. Unsupervised domain adaptation for zero-shot learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 2452–2460, 2015.
  • [29] B. Kulis, K. Saenko, and T. Darrell. What you saw is not what you get: Domain adaptation using asymmetric kernel transforms. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1785–1792. IEEE, 2011.
  • [30] H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. K. Singh, and M.-H. Yang. Diverse image-to-image translation via disentangled representations. In European Conference on Computer Vision, 2018.
  • [31] H. Li, S. J. Pan, S. Wang, and A. C. Kot. Domain generalization with adversarial feature learning. In CVPR, 2018.
  • [32] W. Li, Z. Xu, D. Xu, D. Dai, and L. Van Gool. Domain generalization and adaptation using low rank exemplar svms. IEEE transactions on pattern analysis and machine intelligence, 40(5):1114–1127, 2018.
  • [33] Y. Li, X. Tian, M. Gong, Y. Liu, T. Liu, K. Zhang, and D. Tao. Deep domain generalization via conditional invariant adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 624–639, 2018.
  • [34] J. Lin, Y. Xia, T. Qin, Z. Chen, and T.-Y. Liu. Conditional image-to-image translation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(July 2018), 2018.
  • [35] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 700–708. Curran Associates, Inc., 2017.
  • [36] Y. Lu, Y.-W. Tai, and C.-K. Tang. Conditional cyclegan for attribute guided face image generation. arXiv preprint arXiv:1705.09966, 2017.
  • [37] Y. Luo, L. Zheng, T. Guan, J. Yu, and Y. Yang. Taking a closer look at domain shift: Category-level adversaries for semantics consistent domain adaptation. arXiv preprint arXiv:1809.09478, 2018.
  • [38] S. Motiian, M. Piccirilli, D. A. Adjeroh, and G. Doretto. Unified deep supervised domain adaptation and generalization. In The IEEE International Conference on Computer Vision (ICCV), volume 2, page 3, 2017.
  • [39] K. Muandet, D. Balduzzi, and B. Schölkopf. Domain generalization via invariant feature representation. In S. Dasgupta and D. McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 10–18, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR.
  • [40] Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, and K. Kim. Image to image translation for domain adaptation. arXiv preprint arXiv:1712.00479, 13, 2017.
  • [41] L. Niu, W. Li, and D. Xu. Multi-view domain generalization for visual recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 4193–4201, 2015.
  • [42] L. Niu, W. Li, and D. Xu. Visual recognition by learning from web data: A weakly supervised domain generalization approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2774–2783, 2015.
  • [43] X. Pan, P. Luo, J. Shi, and X. Tang. Two at once: Enhancing learning and generalization capacities via ibn-net. In ECCV, 2018.
  • [44] X. Peng, B. Usman, N. Kaushik, D. Wang, J. Hoffman, K. Saenko, X. Roynard, J.-E. Deschaud, F. Goulette, T. L. Hayes, et al. Visda: A synthetic-to-real benchmark for visual domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 2021–2026, 2018.
  • [45] S. R. Richter, V. Vineet, S. Roth, and V. Koltun. Playing for data: Ground truth from computer games. In B. Leibe, J. Matas, N. Sebe, and M. Welling, editors, European Conference on Computer Vision (ECCV), volume 9906 of LNCS, pages 102–118. Springer International Publishing, 2016.
  • [46] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada. Maximum classifier discrepancy for unsupervised domain adaptation. arXiv preprint arXiv:1712.02560, 3, 2017.
  • [47] F. S. Saleh, M. S. Aliakbarian, M. Salzmann, L. Petersson, and J. M. Alvarez. Effective use of synthetic data for urban scene semantic segmentation. In European Conference on Computer Vision, pages 86–103. Springer, Cham, 2018.
  • [48] S. Sankaranarayanan, Y. Balaji, A. Jain, S. N. Lim, and R. Chellappa. Unsupervised domain adaptation for semantic segmentation with gans. arXiv preprint arXiv:1711.06969, 2017.
  • [49] S. Sankaranarayanan, Y. Balaji, A. Jain, S. N. Lim, and R. Chellappa. Learning from synthetic data: Addressing domain shift for semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [50] Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang, and M. Chandraker. Learning to adapt structured output space for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [51] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In Computer Vision and Pattern Recognition (CVPR), volume 1, page 4, 2017.
  • [52] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [53] Z. Yi, H. R. Zhang, P. Tan, and M. Gong. Dualgan: Unsupervised dual learning for image-to-image translation. In ICCV, pages 2868–2876, 2017.
  • [54] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madhavan, and T. Darrell. Bdd100k: A diverse driving video database with scalable annotation tooling. arXiv preprint arXiv:1805.04687, 2018.
  • [55] O. Zendel, K. Honauer, M. Murschitz, D. Steininger, and G. Fernandez Dominguez. Wilddash - creating hazard-aware benchmarks. In The European Conference on Computer Vision (ECCV), September 2018.
  • [56] Y. Zhang, Z. Qiu, T. Yao, D. Liu, and T. Mei. Fully convolutional adaptation networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6810–6818, 2018.
  • [57] Y. Zhang, Y. Zhang, and W. Cai. A unified framework for generalizable style transfer: Style and content separation. arXiv preprint arXiv:1806.05173, 2018.
  • [58] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint, 2017.
  • [59] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman. Toward multimodal image-to-image translation. In Advances in Neural Information Processing Systems, pages 465–476, 2017.
  • [60] X. Zhu, H. Zhou, C. Yang, J. Shi, and D. Lin. Penalizing top performers: Conservative loss for semantic segmentation adaptation. arXiv preprint arXiv:1809.00903, 2018.
  • [61] Y. Zou, Z. Yu, B. V. Kumar, and J. Wang. Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In Proceedings of the European Conference on Computer Vision (ECCV), pages 289–305, 2018.

6 Appendix

In this Appendix, we provide additional information for,

  • the complete pipeline of our proposed DLOW model,

  • the detailed network structure of our DLOW model for style generalization with four target domains,

  • more examples for style generalization.

6.1 Pipeline of the DLOW model

In Section 3.4 of the main paper, we introduced three modules of the DLOW model by taking the direction as an example. Those three modules were illustrated separately for clarity. In this Appendix, we further present the complete pipeline for a better illustration of our DLOW model. By combining the three modules of the direction and supplementing the structure of the direction , the complete model is shown in Fig 7. Taking the direction for an example (c.f. Fig 6(a)), the orange dotted box shows the adversarial loss module which was illustrated in Fig 3a of the main paper. Correspondingly, the green dotted box and the purple dotted box are the image reconstruction module and the domainness reconstruction module which were illustrated in Fig 3b and Fig 3c of the main paper, respectively. Also, the structure of the other direction is presented in Fig 6(b), which is symmetric to the direction .

6.2 Network Structure for Style Generalization

In Section 3.6 of the main paper, we introduced that our DLOW model can be adapted for style generalization when there are multiple target domains available. We present the details in this section. The network structure of our DLOW model for style generalization is shown in Fig 8, where we have four target domains, each of which represents an image style. For the direction of , shown in Fig 7(a), the style generalization model consists of three modules, the adversarial module, the image reconstruction module and the domainness reconstruction module. For each target domain , there is one corresponding discriminator measuring the distribution distance between the source domain and the target domain . Accordingly, the domainness variable is expanded as a -dim vector . Also, the output of the regressor is expanded to be multiple dimensions to reconstruct the domainness vector. For the other direction , shown in Fig 7(b), the adversarial module is similar to that of the direction . However, the image reconstruction module is slightly different, since the image reconstruction loss should be weighted by the domainness vector .

6.3 Additional Results for Style Generalization

We provided two examples for style generalization in Fig 6 of the main paper. Here we provide more experimental results in Fig 9, Fig 10 and Fig 11. The images with red bounding boxes are translated images in four target domains, i.e., Monet, Van Gogh, Cezanne, and Ukiyo-e. Those can be considered as the “seen” styles. Our model gives similar translation results to CycleGAN model for each target domain. But the difference is that we only need one unified model for the four target domains while the CycleGAN should train four models. Moreover, the images with green bounding boxes are the mixed style images of their neighboring target styles and the image in the center is the mixed style image of all the four target styles, which are new styles that are never seen in the training data. We can observe that our DLOW model could generalize well across different styles, which proves the good domain generalization ability of our model.

(a)
(b)
Figure 7: Complete DLOW model structure: (a) direction from ; (b) direction from . Orange, green, and purple dotted boxes denote the adversarial loss module, the image reconstruction module, and the domainness reconstruction module, respectively.
(a)
(b)
Figure 8: Network structure of DLOW model for style generalization with four target domains: (a) direction from ; (b) direction from .
Figure 9: Examples of style generalization I. Results with red rectangles at four corners are images translated into the four target domains, and those with green rectangles in between are images translated into intermediate domains. The results show that our DLOW model generalizes well across styles, and produces new images styles smoothly.
Figure 10: Examples of style generalization II. Results with red rectangles at four corners are images translated into the four target domains, and those with green rectangles in between are images translated into intermediate domains. The results show that our DLOW model generalizes well across styles, and produces new images styles smoothly.
Figure 11: Examples of style generalization III. Results with red rectangles at four corners are images translated into the four target domains, and those with green rectangles in between are images translated into intermediate domains. The results show that our DLOW model generalizes well across styles, and produces new images styles smoothly.