Unpaired Multi-Domain Image Generation via Regularized Conditional GANs

05/07/2018 ∙ by Xudong Mao, et al. ∙ 0

In this paper, we study the problem of multi-domain image generation, the goal of which is to generate pairs of corresponding images from different domains. With the recent development in generative models, image generation has achieved great progress and has been applied to various computer vision tasks. However, multi-domain image generation may not achieve the desired performance due to the difficulty of learning the correspondence of different domain images, especially when the information of paired samples is not given. To tackle this problem, we propose Regularized Conditional GAN (RegCGAN) which is capable of learning to generate corresponding images in the absence of paired training data. RegCGAN is based on the conditional GAN, and we introduce two regularizers to guide the model to learn the corresponding semantics of different domains. We evaluate the proposed model on several tasks for which paired training data is not given, including the generation of edges and photos, the generation of faces with different attributes, etc. The experimental results show that our model can successfully generate corresponding images for all these tasks, while outperforms the baseline methods. We also introduce an approach of applying RegCGAN to unsupervised domain adaptation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

page 5

page 6

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1:

The framework of RegCGAN. The domain variables (in purple) are conditioned to all the layers of the generator and the input image layer of the discriminator. (a): The pairs of input consist of identical latent variables but different domain variables. (b): One regularizer is used in the first layer of the generator. It penalizes the distances between the first layer’s output of the input pairs in (a), which guides the generator to decode similar high-level semantics for corresponding images. (c): The generator generates pairs of corresponding images. (d): Another regularizer is used in the last hidden layer of the discriminator. This regularizer enforces the discriminator to output similar losses for the corresponding images. These similar losses are used to update the generator. This regularizer also makes the model output invariant feature representations for the corresponding images from different domains. The learned invariant feature representations can be used for domain adaptation by attaching a classifier.

Multi-domain image generation is an important extension of image generation in computer vision. It has many promising applications such as improving the generated image quality  [Dosovitskiy et al.2015, Wang and Gupta2016]

, image-to-image translation  

[Perarnau et al.2016, Wang et al.2017], and unsupervised domain adaptation  [Liu and Tuzel2016]. As shown in Figures 2 and 3, a successful model for multi-domain image generation should be able to generate pairs of corresponding images which share common semantics but are of different domain-specific semantics. Several early approaches  [Dosovitskiy et al.2015, Wang and Gupta2016] have been proposed, but they are all in the supervised setting, which means that they require the information of paired samples to be available. In practice, however, building paired training datasets can be very expensive and may not always be feasible.

Recently, CoGAN  [Liu and Tuzel2016]

has been proposed and achieved great success in multi-domain image generation. In particular, CoGAN models the problem as to learn a joint distribution over multi-domain images by coupling multiple GANs. Unlike previous methods that require paired training data, CoGAN is able to learn the joint distribution without any paired samples. However, it falls short for some difficult tasks such as the generation of edges and photos, as demonstrated by experiments.

In this paper, we propose a new framework called Regularized Conditional GAN (RegCGAN). Like CoGAN, RegCGAN is also capable of performing multi-domain image generation in the absence of paired samples. RegCGAN is based on the conditional GAN  [Mirza and Osindero2014] and tries to learn a conditional distribution over multi-domain images, where the domain-specific semantics are encoded in the conditioned domain variables, and the common semantics are encoded in the shared latent variables.

As pointed out in  [Liu and Tuzel2016], directly using conditional GAN will fail to learn the corresponding semantics. To overcome this problem, we propose two regularizers to guide the model to encode the common semantics in the shared latent variables, which in turn makes the model to generate corresponding images. As shown in Figure 1(a)(b), one regularizer is used in the first layer of the generator. This regularizer penalizes the distances between the first layer’s output of the paired input, where the paired input should consist of identical latent variables but different domain variables. As a result, it enforces the generator to decode similar high-level semantics for the paired input, since the first layer decodes the highest level semantics. This strategy is based on the fact that corresponding images from different domains always share some high-level semantics (ref. Figures 2, 3, and 5). As shown in Figure 1 (c)(d), the second regularizer is added to the last hidden layer of the discriminator which is responsible for encoding the highest level semantics. This regularizer enforces the discriminator to output similar losses for the pairs of corresponding images. These similar losses are then used to update the generator, which guides the generator to generate similar (corresponding) images.

One intuitive application of RegCGAN is unsupervised domain adaptation, since the second regularizer (Figure 1(d)) is able to make the last hidden layer to output invariant feature representations for corresponding images (Figure 1(c)). We can attach a classifier to the last hidden layer, and the classifier is jointly trained with the discriminator using the labeled images from the source domain. As a result, the classifier is able to classify the images from the target domain due to the learned invariant feature representations.

2 Related Work

2.1 Multi-Domain Image Generation

Image generation is one of the most fundamental problems in computer vision. Classic approaches include Restricted Boltzmann Machine  

[Tieleman2008]

and Autoencoder 

[Bengio et al.2013]. Recently, two successful approaches, Variational Autoencoder (VAE) [Kingma and Welling2014] and Generative Adversarial Network (GAN)  [Goodfellow et al.2014], have been proposed. Our model in this paper is based on GAN. The idea of GAN is to find the Nash Equilibrium between the generator network and discriminator network. GAN has achieved great success in image generation, and numerous variants  [Radford et al.2015, Nowozin et al.2016, Arjovsky et al.2017, Mao et al.2017] have been proposed for improving the image quality and training stability.

Multi-domain image generation is an extension problem of image generation in which two or more domain images are provided. A successful model should be able to generate pairs of corresponding images, which means that the image pairs share some common semantics but are of different domain-specific semantics. It has many promising applications such as improving the generated image quality  [Dosovitskiy et al.2015, Wang and Gupta2016] and image-to-image translation  [Perarnau et al.2016, Wang et al.2017]. Early approaches  [Dosovitskiy et al.2015, Wang and Gupta2016] are under the supervised setting, where the information of paired images is provided. However, building training datasets with paired information is not always feasible and can be very expensive. The recent proposed CoGAN  [Liu and Tuzel2016] is able to perform multi-domain image generation in the absence of any paired images. CoGAN consists of multiple GANs and each GAN corresponds to one image domain. Furthermore, the weights of some layers are tied to learn the shared semantics.

2.2 Regularization Methods

Regularization methods have been proven to be effective in GAN learning  [Che et al.2016, Gulrajani et al.2017, Roth et al.2017]. Che et al.  Che2016 introduced several types of regularizers which penalize the missing modes. These regularizers are able to relieve the missing modes problem. Gulrajani et al.  Gulrajani2017 proposed an effective way of regularizing the gradients of the points sampled between the data distribution and the generator distribution. Moreover, Roth et al.  Roth2017 proposed a weighted gradient-based regularizer which can be applied to various GANs. In this paper, we adopt the regularization method for enforcing the model to generate corresponding images.

(a) Digits and edge digits. (b) Digits and negative digits. (c) MNIST and USPS digits.
Figure 2: Generated image pairs on digits.

3 Framework and Approach

3.1 Generative Adversarial Network

The framework of GAN consists of two roles, the discriminator and the generator . Given a data distribution , tries to learn the distribution over data . starts from sampling the noise input from a simple distribution , and then maps to data space . On the other hand, aims to distinguish whether a sample is from or from . The objective for GAN can be formulated as follows:

(1)

3.2 Regularized Conditional GAN

In our approach, the problem of multi-domain image generation is modeled as to learn a conditional distribution over data , where denotes the domain variable. We propose the Regularized Conditional GAN (RegCGAN) for learning . Our idea is to encode the domain-specific semantics in the domain variable and to encode the common semantics in the shared latent variables . To achieve this, the conditional GAN is adopted and two regularizers are proposed. One regularizer is added to the first layer of the generator, and the other one is added to the last hidden layer of the discriminator.

Specifically, as Figure 1 shows, for an input pair with identical but different , the first regularizer penalizes the distance between the first layer’s output of and , which enforces to decode similar high-level semantics, since the first layer decodes the highest level semantics. On the other hand, for a pair of corresponding images , the second regularizer penalizes the distance between the last layer’s output of and . As a result, outputs similar losses for the pairs of corresponding images. When updating , these similar losses guide to generate similar (corresponding) images. Note that to use the above two regularizers, it requires constructing pairs of input which are of identical but different .

Formally, when training, we construct mini-batches with pairs of input and , where the noise input is the same. maps the noise input to a conditional data space . An L2-norm regularizer is used to enforce , the output of ’s first layer, to be similar for each paired input. Another L2-norm regularizer is used to enforce , the output of ’s last hidden layer, to be similar for each paired input. Then the objective function for RegCGAN can be formulated as follows:

(2)

where the scalars and are used to adjust the weights of the regularization terms, denotes the -norm, and denote the source domain and target domain, respectively, denotes the output of ’s first layer, and denotes the output of ’s last hidden layer.

As stated before, RegCGAN can be applied to unsupervised domain adaptation, since the last hidden layer of is able to output invariant feature representations for the corresponding images from different domains. Based on the invariant feature representations, we attach a classifier to the last hidden layer of . The classifier is jointly trained with , and the joint objective function is:

(3)

where the scalars , , and are used to adjust the weights of the regularization terms and the classifier, and is a typical cross-entropy loss.

Note that our approach to domain adaptation is different from the method used in  [Ganin et al.2016] which tries to minimize the difference between the overall distribution of the source and target domains. In contrast, the minimization of our approach is among samples belonging to the same category, because we only penalize the distances between the pairs of corresponding images which belong to the same category.

(a) Shoes by RegCGAN (Ours). (b) Handbags by RegCGAN (Ours).
(c) Shoes by CoGAN. (d) Handbags by CoGAN.
Figure 3: Generated image pairs on shoes and handbags.

4 Experiments

4.1 Implementation Details

Except for the tasks about digits (i.e., MNIST and USPS), we adopt LSGAN  [Mao et al.2017] for training the models due to the fact that LSGAN generates higher quality images and perform more stably. For digits tasks we still adopt standard GAN since we find that LSGAN will sometimes generate unaligned digit pairs.

We use Adam optimizer with the learning rates of for LSGAN and for standard GAN. For the hyper-parameters in Equations 3.2 and 3, we set , , and found by grid search. Our implementation is available at https://github.com/xudonmao/RegCGAN.

4.2 Digits

We first evaluate RegCGAN on MNIST and USPS datasets. Since the image sizes of MNIST and USPS are different, we resize the images in USPS to the same resolution (i.e., ) of MNIST. We train RegCGAN for the following three tasks. Following literature  [Liu and Tuzel2016], the first two tasks are to perform the generations of 1) digits and edge digits; 2) digits and negative digits. The third one is to perform the generation of MNIST and USPS digits. For these tasks, we design the network architecture following the suggestions in  [Radford et al.2015], where the generator consists of four transposed convolutional layers and the discriminator is a variant of LeNet  [Lecun et al.1998]. The generated image pairs are shown in Figure 2, where we can see clearly that RegCGAN succeeds to generate corresponding digits for all the three tasks.

Without Regularizer If we remove the proposed regularizers in RegCGAN, the model will fail to generate corresponding digits as Figure 4 shows. This demonstrates that the proposed regularizers play an important role in generating corresponding images.

4.3 Edges and Photos

We also train RegCGAN for the task of generating corresponding edges and photos. The Handbag  [Zhu et al.2016] and Shoe  [Yu and Grauman2014] datasets are used for this task. We randomly shuffle the edge images and realistic photos to avoid utilizing the pair information. We resize all the images to a resolution of

. For the network architecture, both the generator and the discriminator consist of four transposed/strided-convolutional convolutional layers. As shown in Figure

3(a)(b), RegCGAN is able to generate corresponding images of edges and photos.

(a) With regularizer. (b) Without regularizer.
Figure 4: Comparison experiments between the models with and without the regularizer.

Comparison with CoGAN We also train CoGAN, which is the current state-of-the-art method, on edges and photos using the official implementation of CoGAN. We evaluate two network architectures for CoGAN: (1) the architecture used in CoGAN  [Liu and Tuzel2016] and (2) the same architecture to RegCGAN. We also evaluate the standard GAN loss and least squares loss (LSGAN) for CoGAN. But all of these settings fail to generate corresponding images of edges and photos. The results are shown in Figure 3(c)(d).

(a) Blond and black hair by RegCGAN (Ours). (b) Female and male by RegCGAN (Ours).
(c) Blond and black hair by CoGAN. (d) Female and male by CoGAN.
Figure 5: Generated image pairs on faces with different attributes. The image pairs of black and blond hair by CoGAN are duplicated from the CoGAN paper.

4.4 Faces

In this task, we evaluate RegCGAN on the CelebA dataset  [Liu et al.2014]. We first apply a pre-processing method to crop the facial region in the center of the images  [Karras et al.2017], and then resize all the cropped images to a resolution of . The network architecture used in this task is similar to the one in Section 4.3 except for the output dimensions of the layers. We investigate the following two tasks: 1) female with blond and black hair; and 2) female and male. The results are presented in Figure 5(a)(b). We observe that RegCGAN is able to generate corresponding face images with different attributes, and the corresponding faces are of very similar appearances.

Comparison with CoGAN The generated image pairs by CoGAN are also presented in Figure 5, where the image pairs of black and blond hair by CoGAN are duplicated from  [Liu and Tuzel2016]. We observe that the image pairs generated by RegCGAN are more consistent and of better quality than the ones by CoGAN, especially for the task of female and male, which is more difficult than the task of blond and black hair.

Comparison with CycleGAN We also compare RegCGAN with CycleGAN  [Zhu et al.2017] which is the state-of-the-art method in image-to-image transition. To compare with CycleGAN, we first generate some image pairs using RegCGAN and then use the generated images in one domain as the input for CycleGAN. The results are presented in Figure 6. Compared with RegCGAN, CycleGAN introduces some blur to the generated images. Moreover, the color of the image pairs by RegCGAN is more consistent than the ones by CycleGAN.

Figure 6: Comparison results between RegCGAN and CycleGAN for the task of female and male. The top two rows are generated by RegCGAN. The third row is generated by CycleGAN using the first row as input.

4.5 Quantitative Evaluation

To further evaluate the effectiveness of RegCGAN, we conduct a user study on Amazon Mechanical Turk (AMT). For this evaluation, we also use the task of the female and male generation. In particular, given two image pairs randomly selected from RegCGAN and CoGAN, the AMT annotators are asked to choose a better one based on the image quality, perceptual realism, and appearance consistency of female and male. With votes totally, a majority of the annotators preferred the image pairs from RegCGAN in , demonstrating that the overall image quality of our model is better than the one of CoGAN.

CoGAN RegCGAN (Ours)
User Choice / () 2327 / (77.6%)
Table 1: A user study on the task of female and male generation. With votes totally, of the annotators preferred the image pairs from RegCGAN.

4.6 More Applications

Chairs and Cars In this task, we use two visually completely different datasets, Chairs  [Aubry et al.2014] and Cars  [Fidler et al.2012]. Both datasets contain synthesized samples with different orientations. We train RegCGAN on these two datasets to study whether it is able to generate corresponding images sharing the same orientations. The generated results are shown in Figure 7, where the image resolution is

. We further perform interpolation between two random points in the latent space as shown in Figure

7(b). The interpolation shows smooth transitions of chairs and cars both in viewpoint and style, while the chairs and cars keep facing the same direction.

(a): Chairs and cars.
(b): Interpolation between two random points in noise input.
Figure 7: Generated image pairs on chairs and cars, where the orientations are highly correlated.
Figure 8: Generated image pairs on photos and depth images.

Photos and Depths The NYU depth dataset  [Silberman et al.2012] is used for learning a RegCGAN over photos and depth images. In this task, we first resize all the images to a resolution of and then randomly crop patches for training. Figure 8 shows the generated image pairs.

Figure 9: Generated image pairs on photos and Monet-style images.

Photos and Monet-Style Images In this task we train RegCGAN on the Monet-style dataset  [Zhu et al.2017]. We use the same pre-processing method as in Section 8. Figure 9 shows the generated image pairs.

Figure 10: Generated image pairs on summer Yosemite and winter Yosemite.

Summer and Winter We also train RegCGAN on the Summer and Winter dataset  [Zhu et al.2017]. We use the same pre-processing method as in Section 8. Figure 10 shows the generated image pairs.

4.7 Unsupervised Domain Adaptation

Method MNISTUSPS USPSMNIST
Evaluated on the sampled set
DANN
ADDA
CoGAN
RegCGAN (Ours)
Evaluated on the test set
ADDA
CoGAN
RegCGAN (Ours)
Table 2: Accuracy results for unsupervised domain adaptation. The top section presents the classification accuracy evaluated on the sampled set of the target domain. The bottom section presents the classification accuracy evaluated on the standard test set of the target domain. The reported accuracies are averaged over trails with different random samplings.

As mentioned in Section 3.2, RegCGAN can be applied to unsupervised domain adaptation. In this experiment, MNIST and USPS datasets are used, where one is used as the source domain and the other one is used as the target domain. We set , , and found by grid search. We use the same network architecture as in Section 4.2 and attach a classifier at the end of the last hidden layer of the discriminator. Following the experiment protocol in  [Liu and Tuzel2016, Tzeng et al.2017], we randomly sample images from MNIST and images from USPS.

We conduct two comparison experiments between RegCGAN and the baseline methods, including DANN  [Ganin et al.2016], ADDA  [Tzeng et al.2017], and CoGAN [Liu and Tuzel2016]. One is to evaluate the classification accuracy directly on the sampled images of the target domain, which is adopted in  [Liu and Tuzel2016, Tzeng et al.2017]. To further evaluate the generalization error, we further evaluate the classification accuracy on the standard test sets of the target domain.

The results are presented in Table 2. The reported accuracies are averaged over trails with different random samplings. For the evaluation on the standard test set, RegCGAN significantly outperforms all the baseline methods, especially for the task of USPS to MNIST. This shows that RegCGAN is of smaller generalization error when compared with the baseline methods. For the evaluation on the sampled set, RegCGAN outperforms all the baseline methods for the task of MNIST to USPS, and achieves comparable performance for the task of USPS to MNIST.

5 Conclusions

To tackle the problem of multi-domain image generation, we have proposed the Regularized Conditional GAN, where the domain information is encoded in the conditioned domain variables. Two types of regularizers are proposed. One is added to the first layer of the generator, guiding the generator to decode similar high-level semantics. The other one is added to the last hidden layer of the discriminator, enforcing the discriminator to output similar losses for the corresponding images. Various experiments on multi-domain image generation have been conducted. The experimental results show that RegCGAN succeeds to generate pairs of corresponding images for all these tasks, and outperforms all the baseline methods. We have also introduced a method of applying RegCGAN to domain adaptation.

References