AlignGAN: Learning to Align Cross-Domain Images with Conditional Generative Adversarial Networks

07/05/2017 ∙ by Xudong Mao, et al. ∙ 0

Recently, several methods based on generative adversarial network (GAN) have been proposed for the task of aligning cross-domain images or learning a joint distribution of cross-domain images. One of the methods is to use conditional GAN for alignment. However, previous attempts of adopting conditional GAN do not perform as well as other methods. In this work we present an approach for improving the capability of the methods which are based on conditional GAN. We evaluate the proposed method on numerous tasks and the experimental results show that it is able to align the cross-domain images successfully in absence of paired samples. Furthermore, we also propose another model which conditions on multiple information such as domain information and label information. Conditioning on domain information and label information, we are able to conduct label propagation from the source domain to the target domain. A 2-step alternating training algorithm is proposed to learn this model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative Adversarial Networks (GAN) [5]

has proven hugely successful for various computer vision tasks

[6, 8, 14]. This paper addresses the problem of aligning cross-domain images or learning a joint distribution of cross-domain images [9]. Early approaches [6, 17] for this problem require paired images from different domains, which limits the effectiveness of these approaches. Recently, CoGAN [9] has been proposed, lifting the restriction of paired images. In particular, CoGAN couples two GANs in which two generators share the weights of the first several layers, which guides the two generators to generate aligned images.

In this paper, we introduce a model called AlignGAN for aligning cross-domain images, which is based on the conditional GAN [13]

. Similar to CoGAN, our proposed AlignGAN is also able to align cross-domain images without paired images. The idea of using conditional GAN for alignment is to learn the domain-specific semantics by the conditioned domain vectors and to learn the shared semantics by the other latent vectors. However, as pointed out in literature

[9], adopting conditional GAN directly will fail to align cross-domain images for some tasks. We find that determining which layers to be conditioned by domain vectors is critical to the performance. Our proposed AlignGAN is inspired by the following two ideas. First, for the generator, the highest level semantics of different domains should be similar. Thus we should not condition the domain vectors on the noise input layer of the generator. Second, for the discriminator, we should enhance the domain information signals to let the discriminator know which domain the images are from. The image input layer generates the strongest signal for the discriminator. Thus we should condition the domain vectors on the image input layer of the discriminator. We explore AlignGAN for many tasks including digits and negative digits, blond hair and black hair, and chairs and cars. Furthermore, AlignGAN is not limited to two domains and it can be extended to three or more domains by just adding more dimensions to the domain vectors as Figure 4(a) shows.

Based on AlignGAN, we also propose another model that is conditioned on multiple information such as domain information and label information. Suppose we only have the label information of the source domain. By learning the label information from the source domain and aligning the images using the domain information, the model is able to propagate the label information from the source domain to the target domain. However, directly training on the multiple conditioned information is hard to converge. We propose to condition domain vectors and label vectors on different layers and train the model via alternating optimization.

In this paper, we make the following contributions:

  • We propose AlignGAN which is based on conditional GAN for aligning cross-domain images. We evaluate AlignGAN on numerous tasks and the experimental results demonstrate its effectiveness for aligning cross-domain images.

  • We also propose another model which conditions on multiple information such as domain information and label information. This model is able to propagate the label information from the source domain to the target domain. In addition, a 2-step alternating optimization algorithm is proposed to train this model.

2 Related Works

Goodfellow et al. [5] proposed the generative adversarial network (GAN) which has achieved great successes in generative models. After that, many works have been proposed to improve the image quality [11, 14, 19] or to stabilize the learning process [1, 12, 16]

. Further, GAN has been applied to various computer vision tasks such as image super-resolution

[8], text-to-image translation [15]

, and image-to-image translation

[6].

The most relevant work to this paper is CoGAN [9] which also tries to align cross-domain images. In literature [9], the authors also tried to use conditional GAN for this task. However, their attempt failed in many tasks such as aligning digits and negative digits. Another task which is related to our work is image-to-image translation [7, 21]. Both [20] and [7] adopted two GANs which form a cycle mapping to form a reconstruction loss. Dong et al. [3] proposed to use conditional GAN for image-to-image translation. They first trained a conditional GAN to learn shared features and then trained an encoder to map the images to latent vectors.

3 Model

In this section, we first briefly review GAN and conditional GAN in Section 3.1. Then we present the proposed AlignGAN in Section 3.2. Finally, the model to be conditioned on multiple information is introduced in Section 3.3.

3.1 GAN and Conditional GAN

The framework of GAN consists of two players, the discriminator and the generator . Given a data distribution , tries to learn the distribution . starts from sampling noise input

from a uniform distribution

, and then maps to data space . On the other hand, aims to distinguish whether a sample is from or from . The objective for GAN can be formulated as follows:

(1)

Conditional GAN introduces extra information where both discriminator and generator are conditioned on . The objective for conditional GAN can be formulated as follows:

(2)
(a) (b)
Figure 1: Network architecture of AlignGAN. (a): The discriminator. (b): The generator. ”Conv” and ”Deconv” denote the convolutional layer and deconvolutional layer, respectively. ”FC” denotes the fully connected layer.

3.2 AlignGAN

Our proposed AlignGAN is based on conditional GAN. The intuition is to learn the domain-specific semantics by the conditioned domain vectors and to learn the shared semantics by the other shared latent vectors. Previous attempt [9] of using conditional GAN to align cross-domain images has shown its failure in many tasks. After extensive exploration, we conclude the following two rules for achieving successful learnings.

First, for the generator, the noise input layer should not be conditioned by the domain vectors. Because the model should learn identical highest level semantics for different domains. For the other layers of the generator, they should be conditioned by the domain vectors.

Second, for the discriminator, the image input layer should be conditioned by the domain vectors. Because the input layer generates the strongest signals to let the discriminator know which domain the images are from. For the other layers of the discriminator, we find that whether they are to be conditioned or not is not critical to the performance.

Based on the above two rules, we present the network architecture of AlignGAN in Figure 1.

(a)
(b)
Figure 2: Network architecture of the model conditioning on multiple information. (a): The discriminator. (b): The generator.

3.3 Conditioning on Multiple Information

Another model we proposed is to condition on multiple information such as domain information and label information. Domain information helps to align images from different domains and label information allows to control the class of generated images. One application of combining the two kinds of information is that we can propagate the label information from the source domain to the target domain when we only have the label information of source domain. The idea is to learn the semantics of label information from the source domain and to align the images from the domain information. As a result, the model is able to control the class of generated images of the target domain. One simple method is to concatenate the domain and label vectors first and then to be conditioned by the generator and discriminator. However, we find that this simple method is not able to converge. We propose to condition the domain vectors and label vectors separately, which means that the domain vectors and label vectors are conditioned by different layers. As stated in Section 3.2, the domain vectors should not be conditioned for the noise input layer of the generator. On the contrary, for the label vectors, the highest level semantics vary for different classes. Thus the label vectors should be conditioned by the noise input layer of the generator. As Figure 2 shows, we condition the label vectors on the layers which are not conditioned by the domain vectors.

(a) (b) (c)
Figure 3: Generated results on digit datasets. (a): Digits and edge digits. (b): Digits and negative digits. (c): USPS and MNIST.
(a): Black hair, blond hair and brown hair.
(b): With glasses and without glasses.
(c): Male and female.
(d): With sideburns and without sideburns.
Figure 4: Generated results on face dataset.
(a): Shoes. (b): Handbags.
Figure 5: Generated results on edge and photo dataset.

2-Step Alternating Training. We adopt a 2-step training algorithm to learn the domain-specific semantics and shared label semantics via alternating optimization. In the first step, we utilize the source domain images with label vectors to learn the label semantics, and the domain vectors are set to zero vectors. In the second step, we utilize both the source and target domain images with domain vectors to learn the domain-specific semantics, and the label vectors are set to zero vectors. The training procedure is formally presented in Algorithm 1

. Note that the hyperparameter

is used to adjust the allocation of training iterations between domain semantics and label semantics. In our experiments, we set .

  Input:
       Source domain images with domain vectors and     label vectors .
       Target domain images: with domain vectors .
  for number of training steps do
     if step mod  then
          Update the discriminator using with and zero domain vectors.
          Update the generator with and zero domain vectors.
     else
          Update the discriminator using and with , and zero label vectors.
          Update the generator with , and zero label vectors.
     end if
  end for
Algorithm 1 Alternating training for conditioning on multiple information.

4 Experiments

4.1 Implementation Details

Except for the task of aligning digits and negative digits, we adopt LSGAN [11] for training the models since LSGAN is able to generate higher quality images and stabilize the learning process. For the task of aligning digits and negative digits, we adopt regular GAN because we find that regular GAN performs well for this task while LSGAN will sometimes fail to align the images of digits and negative digits. For LSGAN, we select the parameters of , and which have been proven to minimize the Pearson divergence. Then Equation 1 is replaced with the following formula:

(3)

We use Adam optimizer with learning rates of for LSGAN and for regular GAN. All the codes of our implementation will be public available soon.

Model Selection For LSGAN, we find that the quality of generated images will shift between good and bad during the training process. We select the model manually by checking the quality of generated images at some iterations.

4.2 AlignGAN

In this section, we evaluate AlignGAN on several datasets including digits, faces, edges, chairs, and cars.

4.2.1 Digits

For this task, we use USPS and MNIST datasets to evaluate the performance of AlignGAN. Following literature [9], we first evaluate AlignGAN for the following two tasks. The first one is to align images of digits and edge digits. The second one is to align images of digits and negative digits. In addition, we further apply AlignGAN to align images of USPS and MNIST digits. As Figure 3 shows, AlignGAN learns to align the images successfully for all the three tasks.

4.2.2 Faces

We also evaluate AlignGAN on face images where the CelebFaces Attributes dataset [10] is used for this experiment. We investigated the following four tasks: 1) alignment between different color hairs; 2) alignment between wearing eyeglasses and not wearing eyeglasses; 3) alignment between male and female; 4) alignment between males with sideburns and males without sideburns. The results are presented in Figure 4, where the resolution of generated images is .

4.2.3 Edges and Photos

Another evaluation is to align between edge images and realistic photos of handbags [20] or shoes [18]. Figure 5 shows the generated results with the resolution of and we can observe that AlignGAN learns to align between edges and realistic photos successfully.

Figure 6: Generated results on chair and car dataset. The rotation angles of generated chairs and cars are highly correlated.

4.2.4 Chairs and Cars

Following literature [7], we also investigate the task of aligning images of chairs [2] and cars [4] to study whether AlignGAN is able to learn the rotation relationship between the two different domains. As Figure 6 shows, the rotation angles of generated chairs and cars are highly correlated.

(a)
(b)
Figure 7: Generated results on digit datasets conditioning on domain and label information. The digits are generated from to by controlling the label vectors. (a): Digits and negative digits. (b): USPS and MNIST.

4.3 Conditioning on Multiple Information

We apply the proposed model conditioning on multiple information to two tasks. The MNIST dataset is used for the first task where the source and target domains are digits and negative digits, respectively. The second task is between USPS digits and MNIST digits. Only the label information of source domain is used during training. We generate the digits from to by controlling the label vectors and the results are shown in Figure 7. We have the following two observations. First, the paired images in Figure 7 are highly correlated. Second, we are able to control the classes of generated target domain digits by adjusting the label vectors.

5 Conclusions

In this paper, we proposed two kinds of models. The first one called AlignGAN is for aligning cross-domain images based on conditional GAN. AlignGAN has been evaluated on numerous tasks and the experimental results demonstrate the effectiveness of AlignGAN for aligning cross-domain images. The second one is an extension of AlignGAN, which conditions on not only domain information but also label information. Conditioning on these two kinds of information, we are able to do label propagation from the source domain to the target domain.

References