UGAN: Untraceable GAN for Multi-Domain Face Translation

07/26/2019 ∙ by Defa Zhu, et al. ∙ Beihang University Baidu, Inc. 5

The multi-domain image-to-image translation is received increasing attention in the computer vision community. However, the translated images often retain the characteristics of the source domain. In this paper, we propose a novel Untraceable GAN (UGAN) to tackle the phenomenon of source retaining. Specifically, the discriminator of UGAN contains a novel source classifier to tell which domain an image is translated from, with the purpose to determine whether the translated image still retains the characteristics of the source domain. After this adversarial training converges, the translator is able to synthesize the target-only characteristics and also erase the source-only characteristics. In this way, the source domain of the synthesized image becomes untraceable. We perform extensive experiments, and the results have demonstrated that the proposed UGAN can produce superior results over state-of-the-art StarGAN on three face editing tasks, including face aging, makeup, and expression editing. The source code will be made publicly available.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image translation [16] has been received growing attention in computer vision, which aims to translate images from one domain to another. In this paper, we take face editing as a concrete example of image translation. As shown in Figure 1, the facial age, makeup, and expression of the input face are altered. The input image and output image are also called the source and target image, respectively, indicating they are from source and target domains.

Figure 1: Results of StarGAN and UGAN in three image editing tasks. The first column shows input images. In the face aging case, when changing a face from to , the result of StarGAN still looks like an adult while that of UGAN is more like a child with big eyes and smooth skin. When translating the face to , the result of UGAN also looks more like a juvenile. Similar observations can be made in expression and makeup editing tasks.

Multi-domain translation refers to performing image translation among multiple domains. For example, in the task of face aging with multiple age groups, each age group can be viewed as a domain. Given a face from an age group, we aim to translate it to any other age group using a single translator. One of the representative works for multi-domain image-to-image translation is StarGAN [4]. Though StarGAN is effective, the translated result of it still suffers from the so-called phenomenon of source retaining, i.e. the output image retaining the characteristics of the source domain. See the first row of Figure 1 for an exemplary case of face aging. When changing a female face from age group to , the result of StarGAN still looks like an adult. In makeup editing shown in the second row of Figure 1, StarGAN fails to eliminate the eye shadows in makeup removing completely. A similar observation can be made for expression editing, as shown in the third row of Figure 1. The results of StarGAN show visible teeth shadows around the mouth area.

Figure 2: The domain classifier of the discriminator in StarGAN is easily deceived on face aging task. First row: Given an adult face within age group from the test set, the domain classifier of the discriminator can successfully recognize the corresponding age. Second row: The adult face is translated into years old, and the translated face heavily retains adult characteristics, including beard and expression wrinkles. However, the classifier still does not identify the incompatible characteristics and is entirely fooled by the translated face.

A well-trained domain classifier of discriminator in StarGAN is still easily deceived by non-qualified translated samples even though the domain classifier has good recognition accuracy over real data. As shown in the first row of Figure 2, the discriminator correctly judges an adult face to be within age group. Translating the adult face into , the translated face heavily retains adult characteristics, e.g., beard and expression wrinkles (see the second row of Figure 2). However, the discriminator judges it to be within with a confidence of . These synthetic samples with incompatible characteristics are abundant. We argue that the traditional classifier is not sensitive to combinations of incompatible characteristics and do not provide good guidance to the generator for synthesizing conditional samples. Therefore, a classifier that is more difficult to be deceived is demanded, which should own high sensitiveness to the incompatible domain characteristics.

To this end, we propose Untraceable GAN (UGAN) to tackle the source retaining explicitly. Take the face aging task as an example. Suppose we aim to translate a face from the source domain ( years old ) to the target domain ( years old). The images in the source domain are characterized by beard and wrinkles, while those in the target domain have smooth skin and round face. Also, both domains share some common characteristics, like a smile, which are not altered in the face aging process. The goal of UGAN is to erase all the source-only characteristics and inject certain target-only ones.

To achieve this goal, the discriminator judges which domain the synthesized image is translated from. When training , the real images are defined as translated from their ground truth age domains, while the synthesized images are translated from the age domains of their corresponding input images. If the synthesized result produced by the translator still contains the characteristics of adult age, such as beard and wrinkles, is able to easily identify that it is translated from the years. In order to fool into predicting it as the target domain, needs to erase all the source-only characteristics and inject certain target-only characteristics. In this way, the source domain is “untraceable”, following which our method is named. In contrary to UGAN, in traditional GANs, the discriminator distinguishes where an image is sampled from and is easily fooled by the translator who just simply generates certain target-only characteristics of the target domain, and cannot explicitly erase the adult characteristics.

Our contributions can be summarized as follows.

  • To the best of our knowledge, UGAN is the first work to improve the image translation by explicitly erasing the characteristics of the source domain.

  • The proposed UGAN distinguishes which domain the image is translated from, and can well handle the common source retraining in image translation problem.

  • Extensive qualitative and quantitative experiments in three face editing tasks, including facial expression editing, face aging as well as facial makeup generation, demonstrate the superiority of UGAN over state-of-the-art GANs. The source code will be made publicly available.

2 Related Work

Generative Adversarial Network [11] is a popular generative model that uses an adversarial loss to train generator and discriminator. The generator produces fake data to fool the discriminator , while distinguishes fake data from real data. It has been further improved with the coarse-to-fine strategy [7, 19], normalization of the discriminator or generator [1, 13, 27, 3], advanced adversarial loss [25, 29, 2, 40, 18, 40, 33], etc. In this work, UGAN adopts adversarial loss proposed by WGAN-gp [13], which approximately minimizes the Wasserstein distance between the synthesized distribution and real distribution.

Figure 3: An overview of UGAN. 1) The discriminator D should not only distinguish whether the input sample is real or fake but also determine which domain the sample is translated from. 2) Translator G is trained to fool D by synthesizing realistic images of the target domain.

Conditional GAN [26] is adopted to model multi-domain image-to-image translation in this work. For conditional GAN, takes condition as one of the input to synthesize data which correspond to the condition. On the other hand,

determines whether the given joint distribution of data and condition are real or fake. The vanilla cGAN

[26] directly concatenates the pair of data and conditions, and is trained to distinguish the authenticity of the given pair. Conditions are further concatenated with the hidden features of and in IcGAN [31]. Moreover, an auxiliary classifier is adopted in discriminator of AC-GAN [30], and log-likelihood of the classifier is maximized on both generated and real data. Miyato et al. proposed cGAN with projection discriminator [28]

which significantly improves the quality of conditional image generation on ImageNet

[6]. In our work, condition, domain, and attribute are equivalently treated. For example, in face aging, an age group is viewed as a domain and also as a condition. Inspired by AC-GAN, we adopt an auxiliary classifier in the discriminator. However, using such a discriminator for image-to-image translation will bring the phenomenon of source retaining. We then change the role of this auxiliary classifier in UGAN and make it classify which source domain the given datum is translated from, instead of classifying which domain the given datum is sampled from as in AC-GAN.

Image-to-Image Translation is first defined in pix2pix [16] which works well on paired data. Later the performance of image translation is improved from various aspects. Skip connection from bottom to top with addition or multiplication operation is used to maintain useful input image information [38, 24, 32]. Also some works adopt cascade training from coarse to fine to synthesize images with higher quality and resolution [35, 5]. Choi et al. [4] improved image translation by using extra relevant data for training. Other effective methods include buffer of history fake image [34], multi-discriminator [35], 3D technology [36], variational sampling [42, 10], etc. Since paired data are often unavailable, cycle consistency loss is used to constrain the correspondence on unpaired data in CycleGAN [41], DualGAN [37], and DiscoGAN [20]. If the translator only models directed translation between two domains, translators are required among domains. A single conditional translator for multi-domain translation is seriously demanded. Our work focuses on unpaired multi-domain translation with such a single translator, and well tackles the phenomenon of source retaining that frequently happens to the single-translator multi-domain models.

3 Our method

The framework of UGAN is shown in Figure 3. The input image and the target condition are fed into the translator . The discriminator has two heads: one head is named as the authenticity classifier, which is responsible for distinguishing whether the input sample is real or fake; the other is called the source classifier aiming to determine which domain the sample is translated from, where the real data are supposed to be translated from themselves. Translator

is supervised by three loss functions. First,

is trained to fool the authenticity classifier of to classify the synthetic image as real. Second, is trained to reconstruct the original input image given the translated image and source domain label. Third, is trained to fool the source classifier of to believe that the synthetic image is translated from the target domain. The discriminator judges that which domain the given sample is translated from. As the adversarial training goes on, we expect the translated image to be fully compliant with the target domain and the source domain to be untraceable via erasing any clues from the source domain.

We then introduce the used mathematical annotations. Discriminator here contains two heads including the authenticity classifier and the source classifier , where and

share a feature extraction module

. and are abbreviated as and respectively. is a source domain image-label pair, where represents the image and is its label. By feeding the image and the target label into , it produces . We use to denote the joint distribution of image and domain label . and are the marginal distribution of images and labels.

3.1 Objective Function

In the following, we elaborate on details of each component of UGAN, including authenticity classifier, cycle consistency, source classifier, and the overall loss function.

Authenticity Classifier : The adversarial loss of WGAN-gp [13] is adopted to constrain the synthetic joint distribution to make it approximate the real distribution. The third term in Eq. (1) is a gradient penalty term that enforces the discriminator as a 1-Lipschitz function.

(1)
(2)

where , and .

Cycle Consistence: The input and output are regularized to satisfy the correspondence as following [41] :

(3)

For space limit, and are abbreviated as and , respectively.

Source Classifier : To tackle the phenomenon of source retaining, is made to classify which domain image is translated from. For an image-label pair , we regard as translated from domain to domain . Since aims to classify where an image is translated from, the real datum should be classified into , meaning is translated from domain . The synthetic image should be classified into , meaning is translated from . Translator is trained to fool to classify into the . In this way, is trained to make the source domain of is untraceable and the target domain characteristics are injected to . The adversarial training is formulated as follows:

(4)
(5)

where is the penalty coefficient of source retaining which is a constant.

It is worthy to note that the should be injected with certain target-only characteristics. Recall that in Eq. (4) and (5), is trained to fool to classify into . However, the characteristics of are not pure but mixed with the characteristics of real data sampled from and synthesized data translated from . In order to accurately synthesize the characteristics of the target domain, the number of categories of is augmented as . The first categories are real data and those sampled (translated) from the corresponding domain. The latter categories are fake data, and those translated from the corresponding domain. means input datum is fake and is translated from . In addition, the translator is trained to fool to classify into the category. The adversarial training conducted via optimizing the following:

(6)
(7)

In this process, is trained to identify whether is a fake image and from which domain it is translated. is trained to approach the truly untraceable translator.

Overall Loss Function: and are trained by optimizing

(8)
(9)

where could be or . We denote the version adopting as UGAN-S, where “S” means “Simple”. The training details are illustrated in Algorithm 1. We first train the discriminator with Eq. (8) for times. Then the translator is trained with Eq. (9). The two processes are performed alternatively until reaching iterations.

Method A B C D E F G H I J K L M N O P Q R S T U V Mean
StarGAN 52.1 52.6 61.4 51.5 55.9 64.1 57.8 54.1 42.6 52.5 61.7 69.3 55.2 51.9 55.0 63.2 68.0 60.6 69.9 61.0 59.1 61.3 58.2
UGAN-S 44.5 44.4 53.8 46.9 49.9 59.0 47.8 48.5 37.7 43.2 52.9 59.1 53.4 50.4 53.0 45.6 56.6 52.7 48.4 49.0 47.0 46.4 49.6
UGAN 42.8 45.3 48.3 43.7 47.5 56.0 43.6 44.7 37.6 41.4 47.4 52.4 42.9 43.1 48.5 46.3 52.1 46.6 46.6 46.8 45.4 45.0 46.1
Table 1: Intra FID on CFEE dataset.
method 010 1118 1930 3140 4150 5160 60+ Mean
CAAE 63.8 64.1 67.6 69.8 75.9 78.7 87.2 72.4
C-GAN 83.9 60.7 54.9 54.7 57.4 61.7 70.2 63.4
StarGAN 59.9 38.2 29.9 41.4 37.3 40.0 46.9 41.9
UGAN-S 42.0 33.6 25.2 27.2 28.9 34.4 40.4 33.1
UGAN 44.0 29.5 21.1 21.3 25.4 28.2 34.7 29.2
Table 2: Intra FID on face aging dataset.
method Retro Korean Japanese Naked Smoky Mean
StarGAN 110.9 86.2 74.5 84.4 91.9 89.6
UGAN-S 109.4 70.9 61.8 72.8 74.8 78.0
UGAN 101.7 65.9 58.1 64.5 66.3 71.3
Table 3: Intra FID on MAKEUP-A5 dataset.
1 Require: images , category , discriminator , generator , number of iterations .
2 Require: initial discriminator parameters , initial generator parameters .
3
4 for  do
5      for  do
6           Sample data ,
7           Sample target labels ,
8           Sample coefficient .
9           ,
10           .
11           compute using Eq. (8)
12          
13          
14           end for
15          compute using Eq. (9)
16          
17          
18           end for
Algorithm 1 Training details of UGAN. All experiments use the default setting , , , , , , .

3.2 Discussions

We now explain why phenomenon of source retaining happens and how UGAN effectively handles it. We briefly review one of the most representative multi-domain translation model StarGAN. It also has an authenticity classifier and a cycle consistency constraint . The only difference between StarGAN and UGAN is that the discriminator of StarGAN classifies which domain the real image is sampled from. For StarGAN, the following loss is optimized:

(10)
(11)

where Eq. (10) trains to distinguish different domains while Eq. (11) trains to fool to classify as . is expected to increasingly fit . To take the authenticity classifier and the cycle consistency constraint into consideration, the overall loss of and of StarGAN is

(12)
(13)

StarGAN suffers from the source retaining for the following reasons. First, the first two terms in Eq. (13) may result in lazy editing. This is because is a trivial solution to satisfy the cycle consistency loss and adversarial loss terms. Second, in StarGAN is easy to be deceived. is the only term that constrains the translated image towards the target domain. However the value of quickly drops, while still does not look like sampled from and instead like the adversarial example [12] of . Third, is just like a whitelisting system to inject the characteristics of into , without explicitly erasing the source-only characteristics. That is, StarGAN only knows what characteristics the translated images should contain, and does not know what should not be contained.

Compared with StarGAN, our proposed UGAN has its unique and . can tell is translated from until all source-only characteristics have been erased from . Analogically, will not be judged as translated from until is injected with some target-only characteristics. The configuration not only reduces the source-only characteristics but also forces the translated samples towards the target domain.

4 Experiment

4.1 Datasets

Method A B C D E F G H I J K L M N O P Q R S T U V
StarGAN 28.0 19.5 21.3 18.9 29.4 15.3 31.1 19.3 23.8 18.4 20.8 23.8 35.7 31.6 20.6 14.3 18.8 28.8 38.1 22.0 19.3 15.5
UGAN 72.0 80.5 78.7 81.1 70.6 84.7 68.9 80.7 76.2 81.6 79.2 76.2 64.3 68.4 79.4 85.7 81.2 71.2 61.9 78.0 80.7 84.5
Table 4: AMT results on CFEE dataset().
Method 010 1118 1930 3140 4150 5160 60+
StarGAN 13.2 36.8 39.5 42.1 36.8 15.8 18.4
UGAN 86.8 63.2 60.5 57.9 63.2 84.2 81.6
Table 5: AMT results on face aging dataset ().
Method Retro Korean Japanese Naked Smoky
StarGAN 27.6 40.9 21.4 14.7 30.5
UGAN 72.4 59.1 78.6 85.3 69.5
Table 6: AMT results on MAKEUP-A5 dataset ().
Method Age Group Gap
StarGAN 0.757 0.742 0.745 0.719
UGAN 0.743 0.716 0.715 0.699
Table 7: Cosine similarity on hidden feature of ResNet-18 between source images and the corresponding translated images.

Face aging dataset is collected by C-GAN [24] including face images. Ages are divided into age groups including , , , , , and . of the dataset is randomly selected as the test set, and the rest is the training set. All images are aligned and resized to resolution.

MAKEUP-A5 is a makeup-labeled dataset [23] containing aligned Asian woman faces with makeup categories including retro, Korean, Japanese, naked and smoky. The training set contains images and the rest as the test set. All images are resized to .

CFEE is an expression database [9] of facial expressions with identities. The categories of facial expressions include (A) neutral, (B) happy, (C) sad, (D) fearful, (E) angry, (F) surprised, (G) disgusted, (H) happily surprised, (I) happily disgusted, (J) sadly fearful, (K) sadly angry, (L) sadly surprised, (M) sadly disgusted, (N) fearfully angry, (O) fearfully surprised, (P) fearfully disgusted, (Q) angrily surprised, (R) angrily disgusted, (S) disgustedly surprised, (T) appalled, (U) hatred and (V) awed. We center crop and resize images to and randomly select identities as the test set and use the rest for training.

4.2 Measurements and Baselines

Intra FIDs [15, 28, 8] on each domain and mean of them are used for evaluation. FID is a common quantitative measure for generative models, which measures the 2-Wasserstein distance between the two distributions and on the feature extracted from InceptionV3 model. It is defined as [8]

(14)

where and are distributions of features of real data and synthetic data , and are the mean and the covariance of and . The mean intra FID is calculated by

(15)

where is the domain label among the total domains.

User studies by Amazon Mechanical Turk (AMT): Given an input image, target domain images translated by different methods are displayed to the Turkers who are asked to choose the best one.

Cosine similarity: For the face aging task, cosine similarity between the features of real images and the corresponding translated images is used to measure the degree of source retaining, where the features are extracted by a ResNet-18 [14] trained on the same training set.

Baselines: StarGAN [4] has shown the best performance than DIAT [22], CycleGAN [41] and IcGAN [31]. We, therefore, select StarGAN as our baseline to verify the superiority of our method. For face aging task, we additionally compare two classic GAN-based methods of face aging, including CAAE [39] and C-GAN (without transition pattern network) [24].

4.3 Implementation Details

For a fair comparison, our learning rate is fixed as , while the network architecture and other hyper-parameters are kept the same as StarGAN. Specifically, the architecture of the translator is adopted from Johnson et al. [17]

, which is composed of residual blocks, stride-2 convolution and transpose convolution to downsample and upsample features. For the discriminator, a

()-dim source classifier is added to PatchGANs [16] for UGAN (UGAN-S). All experiments are optimized by Adam with and . The discriminator is iterated times per iteration of the translator. All baselines and our methods are trained epochs. Mini-batch size is set to

. All images are horizontally flipped with a probability of

as data augmentation.

Figure 4: Comparison of face aging synthesis results on the face aging dataset.
Figure 5: Comparison of makeup synthesis results on the MAKEUP-A5 dataset.
Figure 6: Comparison of facial expression editing results on the CFEE dataset.

4.4 Quantitative Experiments

Given domain label , we traverse all images from the test set to generate fake images. All the synthetic images of each domain are adopted to calculate intra FID and classification accuracy, while synthetic images of each domain are randomly sampled to be evaluated by AMT.

Face Aging: The comparison of results on face aging dataset is shown in Table 2. Face aging involves deformation and texture synthesis. For example, deformation, such as face shape and eye size, are the main differences between babies and adults. Texture synthesis, like adding wrinkles, is also essential when translating a middle-aged man to a senior man. In Table 2, both UGAN, and UGAN-S are significantly better than StarGAN on all age groups, where UGAN achieve the best performance. The mean intra FID drops from (StarGAN) to (UGAN). The relative drop is more than .

Makeup Editing: The comparison of results on MAKEUP-A5 dataset is shown in Table 3. Both texture and color need to be altered in makeup editing. UGAN has the best performance in all categories. The mean intra FID has declined from (StarGAN) to (UGAN).

Expression Editing: The comparisons on CFEE dataset are shown in Table 1. The expression editing task aims to change the emotion of a face by deformation. The CFEE dataset contains kinds of fine-grained expressions, which makes the expression editing problem very challenging. From the results, we can conclude that UGAN again achieves the best performance. The mean intra FID is (StarGAN), (UGAN-S) and (UGAN) respectively. It can be seen that the reduction is significant.

AMT User Studies: For further evaluation, user studies are conducted on AMT 111https://www.mturk.com/ to compare StarGAN and our method. Since UGAN outperforms UGAN-S for mean intra FID, only UGAN is compared. With datasets mentioned above, we synthesize pairs of images per domain by UGAN and StarGAN. All image pairs are shown to Turkers who are asked to choose the better one considering image realism and satisfaction of target characteristics. Table 4, 5 and 6 show the percentage of our method beating StarGAN. For example, in Table 5, when changing a face to years old, StarGAN wins in cases while our method wins in cases. It again shows the advantages of our method when transforming a face into childhood. Generally, our method is better than StarGAN on every category of each dataset.

Tackling the phenomenon of source retaining: The effect of erasing source characteristics on face aging is shown in Table 7. A well-trained ResNet-18 (for age recognition) is adopted to extract features (the second last layer). We calculate average cosine similarity on the neural feature of all source image and translated image pairs from the test set. Intuitively, the smaller the similarity, the more thoroughly source characteristics are erased. Since the images of adjacent age groups are similar, we only consider translation across a large age gap, e.g., across three age groups. In Table 7, we perform the experiments on multiple age group gaps, and the similarities of UGAN are smaller on all age group gaps.

4.5 Qualitative Experiments

Face Aging: Results on face aging dataset are shown in Figure 4. In the first example, an input image is a middle-aged man. By comparing the results of years old (second column), our result has obvious childish characteristics, e.g. round face, big eyes, and small nose, while the result of StarGAN does not look like a child. Another example is the years old case (last column). Our result has white hair, wrinkles, while StarGAN still produces a middle-aged face. Similar observations can be drawn from the second example of a woman input. These results show that UGAN can explicitly erase the characteristics of the source image by the source classifier in the discriminator.

Makeup editing: Two exemplary results on MAKEUP-A5 dataset are displayed in Figure 5. For the first woman, by comparing the results of the second (retro) and last (smoky) columns, we find that blusher and eye shadows of UGAN are more natural while StarGAN draws asymmetrical blusher and strange eye shadows. The result of UGAN is relatively natural when translating it to a naked face. Therefore, we conclude that UGAN has learned precise color and texture characteristics of different makeups.

Expression editing: Results on CFEE dataset are demonstrated in Figure 6. We have the following observations. Firstly, UGAN can well edit kinds of fine-grained facial expressions. Also, UGAN captures the subtle differences between basic and compound expressions. For example, “Happily surprised” has bigger eyes and raising eyebrows compared to “Happy”. Besides, the results of StarGAN under various expressions still retain the original expressions. For example, when changing the man from “Hatred” to “Happy”, the result of StarGAN still has tight brows. Comparatively, UGAN can effectively synthesize the “Happy” expression by generating a grin and relaxed brows and erasing the tight brows.

5 Conclusion

The phenomenon of source retaining often occurs in the image-to-image translation task. We propose Untraceable GAN to tackle it, where the discriminator estimates which domain the datum is

translated from. is trained to fool the discriminator to believe that the generated datum is translated from the target domain so that source-only characteristics are erased accordingly. In this way, the source domain of the synthesized image is untraceable. Extensive experiments on three tasks prove our significant advantages over the state-of-the-art StarGAN. More results can be found in the supplementary material.

The phenomenon of source retaining and the idea of UGAN are universal. For example, language translation [21] often preserves the grammatical structure of the source language. UGAN may serve as a solution to improve translation quality. We plan to study this idea in-depth and apply it in broader fields.

References