Local Facial Attribute Transfer through Inpainting

by   Ricard Durall, et al.

The term attribute transfer refers to the tasks of altering images in such a way, that the semantic interpretation of a given input image is shifted towards an intended direction, which is quantified by semantic attributes. Prominent example applications are photo realistic changes of facial features and expressions, like changing the hair color, adding a smile, enlarging the nose or altering the entire context of a scene, like transforming a summer landscape into a winter panorama. Recent advances in attribute transfer are mostly based on generative deep neural networks, using various techniques to manipulate images in the latent space of the generator. In this paper, we present a novel method for the common sub-task of local attribute transfers, where only parts of a face have to be altered in order to achieve semantic changes (e.g. removing a mustache). In contrast to previous methods, where such local changes have been implemented by generating new (global) images, we propose to formulate local attribute transfers as an inpainting problem. Removing and regenerating only parts of images, our Attribute Transfer Inpainting Generative Adversarial Network (ATI-GAN) is able to utilize local context information, resulting in visually sound results.



There are no comments yet.


page 1

page 3

page 8


CAFE-GAN: Arbitrary Face Attribute Editing with Complementary Attention Feature

The goal of face attribute editing is altering a facial image according ...

MulGAN: Facial Attribute Editing by Exemplar

Recent studies on face attribute editing by exemplars have achieved prom...

Deep Identity-aware Transfer of Facial Attributes

This paper presents a Deep convolutional network model for Identity-Awar...

Controllable Face Aging

Motivated by the following two observations: 1) people are aging differe...

Unsupervised Visual Attribute Transfer with Reconfigurable Generative Adversarial Networks

Learning to transfer visual attributes requires supervision dataset. Cor...

Continuation of Famous Art with AI: A Conditional Adversarial Network Inpainting Approach

Much of the state-of-the-art in image synthesis inspired by real artwork...

Visual Attribute Transfer through Deep Image Analogy

We propose a new technique for visual attribute transfer across images t...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative deep learning modeling is an ongoing growing field, in which recent works have shown remarkable success in different domains. In particular, the computer vision community has witnessed a dramatic improvement in large variety of tasks, ranging from image synthesis

[27, 12, 3]

to image-to-image translation

[10, 4, 19]. The latter task poses the problem of translating images from one domain to another, including style transfer [11, 29, 13], inpainting [24, 16, 25, 26], attribute transfer [14, 6, 4], and others.

Figure 1: Image-to-image translation results on the CelebA dataset. The first column shows the original images, the second the input masked images, the third the inpainted translation and the remaining columns are attribute transfer results (eyeglasses, smiling and mustache). Note that results from attribute translation are opposite of the original.

The objective of attribute transfer is to synthesize new and realistic appearing images for a pre-defined target domain. For instance, Fig. 1 row 6 shows a non-smiling man with mustache wearing eyeglasses (these are the given attributes, a.k.a. domains) and the output results show how these attributes have been changed one at a time according to our attribute target domain. We refer to a domain as a set of images sharing the same attributes. Such attributes are meaningful semantic features of an image, such as a mustache or a face with eyeglasses. Like other image-to-image translations, attribute transfer methods have also achieved an impressive progress by implementing different variants of GANs [6, 4, 13] leading to state-of-the-art results in the field. Nevertheless, these attribute transfer approaches are mostly based on the global manipulation of GAN latent space. As a result, in order to produce good transfer results, the aforementioned methods require additional inverse generator paths (which tends to make them less stable) and can be quite cumbersome.

Image inpainting or completion refers to the task of inferring locally missing or damaged parts of an image. It has been applied to many different applications like photo editing, restoration of damaged paintings, image-based rendering and computational photography. The main challenge of image inpainting is to synthesize realistic pixels for the missing regions that are coherent with existing ones. Image inpainting techniques are mostly separated into two groups regarding their basic approach. The first group uses local methods [2, 9]

based on low-level feature information, such as color or texture, to attempt to solve the problem. The second group relies on recognizing patterns in images, e.g. deep convolutional neural networks (CNN), to predict pixels for the missing regions. CNN-based models deal with both local and global features and can, in combination with generative adversarial networks (GANs)

[7], produce realistic inpainted outputs. The introduction of GANs have inspired recent works [20, 24, 16, 10, 25] which formulate the inpainting tasks as a conditional image generation problem, using a generator for inpainting and a discriminator for evaluating the result.

1.1 Contributions

The contribution of this paper is a novel attribute transfer approach that alters given natural images in such a way, that the output image meets the pre-defined visual attributes. To do so, our proposed architecture integrates an inpainting block. In particular, we take advantage of the fact, that most facial attributes are induced by local structures (e.g. relative position between eyes and ears). Hence, it is sufficient to change only parts of the face while the remaining parts can be used to force the generator into realistic outputs. Note that the hole/mask (in terms of inpainting) is generated by our method to apply this “trick”. The proposed ATI-GAN model integrates inpainting for local attribute transfer in a single end-to-end architecture with three main building blocks: on one hand, we have an inpainitng network that takes masked images as input and outputs realistically restored images; on the other hand, we have a second network that takes these inpainted images and encoded attributes (e.g. one-hot vector) as input, and learns how to separate attribute information from the rest of the image representation; finally, a third network, which acts as a discriminator, judges if the overall result looks realistic.

Evaluations of our model on the CelebA [17] dataset of faces demonstrate the capacity of ATI-GAN to produce high quality outputs. Quantitative and qualitative results show superior inpainting and attribute transfer performance.

2 Related Work

In computer vision. deep learning approaches have heavily contributed in many semantic image understanding tasks. In this section, we briefly review publications related to our work in each of the different sub-fields. In particular, since our proposal is based on GANs and image translation, more specifically in inpainting and attribute transfer, we will review the seminal work in that direction.

(a) Training the reconstructor .
(b) Training the discriminator .
(c) Training the generator .
Figure 2: Overview of ATI-GAN network while training. It is composed of a reconstructor , a generator and a discriminator which are trained independently. (fig:r) takes the masked image as input and reconstructs a realistic image, which it is judged by . (fig:d)

takes real and fake images as inputs and learns to distinguish between them at global and patch levels. Furthermore, it also learns to classify them to their corresponding domain. (fig:g)

takes both, the output from and the target domain label, as inputs and generates the domain transformed image. Then, the transformed image and the original domain label are fed into (creating a loop). After that, both outputs from (from the two iterations) are passed to one at a time.

2.1 Generative Adversarial Networks

Generative adversarial networks [7] have arisen as a reliable framework for deep generative models. They have shown remarkable results in various computer vision tasks like image generation [27, 12], style transfer [11, 29], inpainting [20, 24, 16, 10, 25] and attribute transfer [14, 6, 4]. The vanilla GAN model consists of two networks, a generator and a discriminator . Its procedure can be seen as a minmax game between , which learns how to generate samples resembling real data, and a discriminator , which learns to discriminate between real and fake data. Throughout this process, indirectly learns how to model the input image distribution by taking samples from a fixed distribution (e.g. Gaussian) and forcing the generated samples to match the input images

. The objective loss function is defined as


GAN-based conditional approach [18] has shown a rapid progress and it has become an essential ingredient for recent research. The intuition behind this kind of GANs is to insert the class information into the model, in order to generate samples that are conditioned on the class. In this work, we take advantage of this property and we encode the attribute characteristics as a conditional information which will be fed into the model.

2.2 Image Inpainintg

Classical inpainting methods are often based on either local or non-local information to rebuild the patches. Local methods [2, 9] attempt to solve the problem using only context information, such as color or texture, i.e. matching, copying and merging backgrounds patches into holes by propagating the information from hole boundaries. These approaches need very little training or prior knowledge, and provide good results especially in background inpainting tasks. However, they do not perform well for large patches because of their inability of generating novel image contents. More powerful methods are global, content-based and semantic inpainting approaches. Even though, these techniques require more expensive dense patch computations, they can handle larger patches successfully. In many models, CNN-based approaches have become the de facto implementation due to their capability to learn to recognize patterns in images and use them to fill holes in images.

GANs for image translation [20, 24, 16, 10] have emerged as a promising paradigm for inpainting tasks. Nowadays, they are already able to produce realistic synthetic outputs with high quality image resolution [25, 26]. In order to reach this point though, GAN’s techniques have been evolving quite intensively over the past few years. Initially, the inpainting task was formulated as a conditional image generation problem, consisting of one generator and one discriminator. However, more recent works [10, 25] have introduced the concept of global and local discriminators. Furthermore, apart from modifying the topology from the discriminator, recent methods [10, 16, 20, 25] have adopted new losses such as Wasserstein [1] or Wasserstein with gradient penalty [8]. Inspired by all of these works, our approach also leverages the conditioned adversarial framework with global and local discriminators together with Wasserstein loss gradient penalty.

2.3 Attribute Transfer

Small and unbalanced datasets can cause severe problems when training a machine learning model. Recently, numerous works have put their attention towards transferring visual attributes, such as color [28], texture [11, 29], facial features [14, 6, 4] and more, for data augmentation. However, although most of the approaches correctly synthesize new attributes belonging to the target domain, it is still very challenging to generalize attributes between different applications since they are usually designed to transfer a specific type (e.g. facial expressions, facial attributes, or even colors).

GAN-based approaches for image translation have been actively studied. One of the first proposals [11], which was capable of learning consistent image domain transforms, employed a pair of images that could be used to create models that convert from the original to the target domain (e.g. segmentation labels to the original image). Unfortunately, this system requires that both, images and target images must exist as pairs in the training dataset in order to learn the transformation between domains. Several works [29, 4] have tried to address this drawback. They suggested to use the virtual result in the target domain. In this way, if the virtual result is inverted again, the inverted result must match with the original image. In these works, the framework can flexibly control the image translation into different target domains.

3 Method

In the following section, we describe ATI-GAN approach, which addresses image-to-image translation for facial attribute transfer. We explain the training of the reconstructor, generator and discriminator in detail, showing that our model trains in an introspective manner, such that it can estimate the difference between the generated (fake) samples and the real samples, and finally update itself to produce more realistic samples.

3.1 Model Architecture

The network architecture of our proposal is depicted in Fig. 2. It is separated in an inpainting network (reconstructor Fig. 1(a)), generative network (generator Fig.1(c)) and discriminative network (discriminator Fig. 1(b)). By combining each of these blocks in a sequential manner, the ensemble model is able to perform an end-to-end attribute transfer.

Figure 3: The figure shows the reconstructor structure with as input and as output. Also depicts the modification on to get , which will be fed into .

Reconstructor. Given an image and its masked paired , the reconstructor takes and tries to fill the large missing region with plausible content so that the output looks realistic (see Fig. 3). To achieve this objective, the reconstructor is backed up by the discriminator (as if it were a vanilla GAN setting) that assesses the reconstructed images. We can formulate the inpainting training process as a minimization problem


On the one hand, constrains the reconstructed image by minimizing the absolute differences between the estimated values and the existing target values. It is defined as


where the subindeces ct and p refer to contour and patch respectively. We can observe these concepts in Fig. 3(a), where a detailed visual example is depicted.

We apply on a separately distance norm for the contour and for the patch. We treat them differently because the patch loss does not have to be strictly 0 since those synthetic images, which are not exactly equal to the original, are also a valid solution, if only if they look realistic and fit with the contour.

On the other hand, penalizes unrealistic images and (see Eq. 9) looks after incorrect image-domain transformations. For these two cases the ideal solution converges to 0 loss. Further details are discussed in the following subsections.

In terms of topology implementation, we adopt the coarse network architecture introduced in [25]

. Since the size of the receptive fields are a decisive factor in inpainitng tasks, we use dilated convolutions to guarantee a sufficient large size. Additionally, we use mirror padding for all convolution layers and exponential linear unit (ELUs) as activation functions.

Generator. The role of the generator is to learn the mappings among multiple attribute domains. To achieve this goal, we use the modified output from the reconstructor (see Fig. 3(b)) as input. Then we train to translate into an output image , which is conditioned on the target domain code label (see Fig. 5). In the same manner as in , both and also play important roles in the generator loss definition. Moreover, we have an extra term called cycle consistency loss . Previous works [14, 29, 4] have shown how this term helps to create strong paths between latent space and outcomes, and in our case, it guarantees that the translated images preserve the content of their input images while changing only the domain-related part of the inputs though the latent space. This term is defined as




Finally, joining it all terms together we have the following formula for generator loss


Note that the generator performs the entire cyclic translation for every sample, forcing the code to be crucial for moving among domains. First, to translate an original image into an image in the target domain and then to recover the original image from the translated image .

(a) Elements from : masked image , contour from reconstructed image , patch from original image and from reconstructed image .
(b) is produced by combining and .
Figure 4: Visual example of the elements that take an active role in the reconstructor training step.
Figure 5: The figure shows the generator structure with and feature code as inputs and as output.

Regarding the topology, we construct our baseline generator based on one of state-of-the-art image-to-image translation model [4], which is an adaptation from [29]

. On top of it, we apply several topology changes to adapt it to our system. In the aftermath, the final generator architecture consists of three convolutional layers, two of them with the stride size of two for down-sampling, six residual blocks, where in each block dilated layers with different dilatation values are added, and two transposed convolutional layers with the stride size of two for upsampling. We use instance normalization in the network.

Discriminator. Our discriminator behaves slightly different from vanilla GAN. It takes samples of real and generated data and then tries to classify them correctly according to their attribute. Additionally, there is a second in-built classifier that tries to determine the domain from each sample. As a result, the discriminator needs to have one adversarial loss that judges the appearance of the images and one classification loss that classifies the attributes .

The inner structure of our discriminator does also differ from standard discriminators. It is split into two fully convolutional topologies and , and one final convolutional layer to combine both discriminators’ outputs (see Fig. 6). All layers from have a stride size of two for downsampling followed by LeakyReLUs as activation function.

While the global discriminator assesses the semantic consistency of the whole image, the patch discriminator deals with the reconstructed initial masked part to enforce local consistency. As a consequence, every image is evaluated by two independent loss functions and which together form the joint adversarial loss


Over the last year, formulations of adversarial loss functions have continuously been changing and improving. GANs based on Earth-Mover distance loss [1] were one of the first attempts to clearly outperform vanilla GAN [7]. Consequently, several approaches on image-to-image translation [18, 11, 29] and generative inpainting networks [10, 16, 20] relied on DCGAN [21] for their adversarial supervision. However, more recent research [8] has showed that there are beneficial effects for image generation when adding a gradient penalty instead of weight clipping to enforce the Lipschitz constraint. As a result, a second wave of publications [4, 25] propose to use WGAN-GP. Following this approach, we write and as


where is sampled uniformly along a straight line between a pair of a real and generated images. is analog to after replacing for .111Note that Eq. 8 belongs to the case when or is updated. For updating the input has to be replaced by and .

Figure 6: Illustration of the inner structure of our discriminator. It has two parts: which discriminates at global scale (producing ) and which does so at patch scale (producing ). There is also the class error (either or ) which uses the whole architecture to determine the class label. Note that this figure shows an example of being trained and fed with real data and therefore, the superindex .

As we mentioned above, similarly to [5], our discriminator relies on a second loss function which accounts for domain classification. It computes the binary cross entropy function between the output domain labels from and one of two variants, either or . Therefore, we can write the classification loss training from the reconstructor or the generator as


where stands for fake input. Note that if we break Eq. 9 down, we can see that the input from is the result of when given a reconstructed input image and a target domain label . In the case of training the reconstructor, will be directly fed with but this time conditioned on . On the other hand, the discriminator loss is written as


where stands for real input. Note that the procedure is almost the same as for the reconstructor, but now the input in is the real image , and by extension, we call it .

The key of a good global training relies heavily on the discriminator. On the one hand, indirectly forces the generator by penalizing the gradients through to produce correct image-domain transformations as well as it learns to generate the correct output label when is fed with real data . On the other hand, it also needs to learn to classify between the true and generated data and penalize accordingly . Therefore the discriminator’s optimization problem is formulated as


4 Experiments

In this section, we present results for a series of experiments evaluating the proposed method, both quantitatively and qualitatively. We first give a detailed introduction of the experimental setup. Then, we discuss the inpainting outcomes and finally, we review the attribute transfer results.

4.1 Experimental Settings

We train ATI-GAN on the CelebFaces Attributes (CelebA) dataset [17]. It consists of 202,599 celebrity face images with variations in facial attributes. We randomly select 2,000 images for testing and use all remaining images as the training dataset. In training, we crop and resize the initially 178x218 pixel image to 128x128 pixels, and we mask them with 52x52 size patches. These masked regions are always centered around the tip of the nose, occluding in most of the cases a large portion of the face (see Figure 7). All experiments presented in this paper have been conducted on a single NVIDIA GeForce GTX 1080 GPU, without applying any post-processing.

Figure 7: Two visual examples of original images and their masked pair. Due to our hard encoding in masks position, there are cases in which the mask is not suitable (left pair).

4.2 Training

Since our model is divided into three distinguishable parts (reconstruction/inpainting, generative and discriminative), three independent Adam optimizer with , . are used during training. We set the batch size to 16 and run the experiments for 200,000 iterations. We start using the output of the reconstructor as input for the generator after iteration 50K. In this way, we ensure that the gradient updates in the generative model are reliable. We update the generator after every five discriminator updates as in [8, 4]

. The learning rate used in the implementation is 0.0001 for the first 10 epochs and then linearly decreased to 0 over the next 10 epochs. The losses are weighted by the factors:

, and set to 10, to 5 and to 1 respectively. The training procedure as a whole is described in Algorithm 1.

1:  Require 1: , number of iterations. , threshold indicating when starts to use modified output from reconstructor (). ’s, learning rates. , batch size. , number of skipped iterations of the generator per discriminator iteration.
2:  Require 2: , initial discriminator parameters. , initial reconstructor parameters. , initial generator parameters.
3:  for  do
4:     Sample a batch from real data
5:     Sample a batch from masked data
6:      Train discriminator
8:      Train reconstructor
10:      Train generator
11:     if  then
12:        if  then
14:        else
16:        end if
17:     end if
18:  end for
Algorithm 1 Training of the proposed architecture. In all experiments were used the following default values , , , ,

4.3 Image Inpainting

The image inpainting problem has a number of different formulations. The definition of our interest is: given that most of the pixels of a face are unobserved because they are masked, our objective is to restore them in a natural way, so that we end up having a plausible and realistic face. In order to achieve good inpaiting results, our synthesized faces must fit into these masks/holes taking into account both, the reconstruction quality of the face as well as the adaptation with the rest of the image. Note that this pixel transformation will be conditioned on the desired attribute transfer too.

As mentioned in [24, 25]

, it is important to notice that there is no perfect numerical metric for semantic inpainting due to the existence of infinite amount of possible solutions. Note that image inpainting algorithms do not try to reconstruct the ground-truth image, but to fill the masked area with content that looks realistic. As a result, the ground-truth image is only one of many possibilities. Following classical inpainting approaches, we employ in our evaluation study the peak signal-to-noise ratio (PSNR). However, this metric might oversimplify the comparison since it directly measures difference in pixel values. Therefore, it is usually combined with a second metric called structural similarity (SSIM) which offers a more elaborated and reliable measurement values.

Our inpainting goal is always under the same conditions i.e. regenerating missing facial attributes, our mask will be a central square patch in the image. This is the standard crop procedure for CelebA since most of the information lies on the center of the image. Table 1 shows the comparison results for PSNR and SSIM metrics, where similar works have also reported the scores based on square centered crops.

Method PSNR (dB) SSIM
SIIWGAN [22] 19.2 0.920
SIIDGM [24] 19.4 0.907
CE [20] 21.3 0.923
GL [10] 23.19 0.936
GntInp[25] 23.80 0.940
GMCNN [23] 24.46 0.944
GL+LID[15] 25.56 0.953
Ours 31.8 0.946
Table 1: Quantitative evaluation in terms of PSNR and SSIM metrics on the CelebA testing dataset. Higher values are better.

We are inclined to think that the improvement of the metrics (specially PSNR) comes from a good equilibrium between our reconstructor and our discriminator. While the learns to produce the coarse features from faces (natural-looking structures) via , enforces to smooth the results “asking” for finer details via through and .

Finally, theses results demonstrate that our approach is able to utilize the end-to-end model architecture to propagate informative gradients, which eventually lead to a significant performance gain. Nevertheless, note that we do not aim at outperforming state-of-the-art image inpainting techniques, but we use it as a crucial part for our attribute transfer system.

4.4 Attribute Transfer

In this subsection, we focus on attribute manipulation. We validate that faces change according to the specified target attribute. This phenomenon is known as attribute transfer or morphing. In particular, we focus on the following set of attributes: eyeglasses, mustache, smiling and young.

Figure 8 shows the transformed images (with the target attribute), in which we can qualitatively determine the results of the model by judging the attribute transfer results. We can observe that ATI-GAN clearly generates natural-looking faces containing the target attributes providing very competitive results on test data. This is possible because of the inherent properties of the end-to-end system that takes advantage of the inpainting structure (among others) presented in this work.

(a) Smiling (up) and Not-Smiling (down) transformations.
(b) Eyeglasses (up) and Not-Eyeglasses (down) transformations.
(c) Young (up) and Old (down) transformations.
(d) Not-Mustache (up) and Mustache (down) transformations.
Figure 8: Each pair depicts the real image (left) and the transformed image (right). The variety within the examples shows robust feature translation independently of genre, race and age.
Smiling Not-Smiling Not-Smiling Smiling
86 88
Eyeglasses Not-Eyeglasses Not-Eyeglasses Eyeglasses
50 82
Young Old Old Young
85 70

Mustache Not-Mustache Not-Mustache Mustache
Men      Men Women
Table 2: Perceptual evaluation on transformed images for each attributes (smiling, eyeglasses, young and mustache).

Additionally, we have performed a user study to assess attribute transfer tasks. It consists of a survey, in which users have to label with 0 when the attribute is not recognized and 1 otherwise. For each attribute transfer, we conduct the test on of the testing data. Note that the generated images have a single attribute translation from the aforementioned list.

According to Table 2, a big part of our translations achieves a successful attribute transformation. More interestingly, however, is to analyze what the meaning behind the percentage is. By inspecting pairs of transformations, for instance Eyeglasses Not-Eyeglasses and vice versa, we can notice that there is no symmetry on transferring attributes. This mainly happens because ATI-GAN is not an invertible model, therefore, moving from domain A to B involves one path (certain set of operations) and from B to A another path (another set of operations). A second plausible explanation for this asymmetric occurrence can be the nature of the dataset used in the experiments. It is known that CelebA suffers from unbalanced attributes, meaning that not all the attributes are equally present. As a result, the ability to transfer might be conditioned on the amount of samples containing the involved feature. For example, Not-Mustache Mustache has much lower success rate for women than for men, because there are no examples of women wearing mustaches.

5 Conclusions

In this paper, we introduce a novel image-to-image translation model capable of applying an accurate local attribute transformation. Previous attribute transfer works were mostly based only on the manipulation of GAN latent space. However, we propose a completely different approach utilizing inpainting as a part of our embedded system. Our method takes advantage of the fact that attributes are induced by local structures. Therefore, it is sufficient to change only parts of the image, while the remaining parts can be used to force the generator into realistic outputs. We show how ATI-GAN can synthesize high quality human face images. We do believe the method is generalisable to other objects and domains being able to produce synthetic data containing certain attributes on demand. We see many interesting avenues of future work including exploring multi-attribute transfer.