Fader Networks: Manipulating Images by Sliding Attributes

06/01/2017 ∙ by Guillaume Lample, et al. ∙ Facebook Laboratoire d'Informatique de Paris 6 0

This paper introduces a new encoder-decoder architecture that is trained to reconstruct images by disentangling the salient information of the image and the values of attributes directly in the latent space. As a result, after training, our model can generate different realistic versions of an input image by varying the attribute values. By using continuous attribute values, we can choose how much a specific attribute is perceivable in the generated image. This property could allow for applications where users can modify an image using sliding knobs, like faders on a mixing console, to change the facial expression of a portrait, or to update the color of some objects. Compared to the state-of-the-art which mostly relies on training adversarial networks in pixel space by altering attribute values at train time, our approach results in much simpler training schemes and nicely scales to multiple attributes. We present evidence that our model can significantly change the perceived value of the attributes while preserving the naturalness of images.



There are no comments yet.


page 2

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We are interested in the problem of manipulating natural images by controlling some attributes of interest. For example, given a photograph of the face of a person described by their gender, age, and expression, we want to generate a realistic version of this same person looking older or happier, or an image of a hypothetical twin of the opposite gender. This task and the related problem of unsupervised domain transfer recently received a lot of interest icgan ; yan2016attribute2image ; image2image ; cyclegan ; dtn ; lior_avatars , as a case study for conditional generative models but also for applications like automatic image edition. The key challenge is that the transformations are ill-defined and training is unsupervised: the training set contains images annotated with the attributes of interest, but there is no example of the transformation: In many cases such as the “gender swapping” example above, there are no pairs of images representing the same person as a male or as a female. In other cases, collecting examples requires a costly annotation process, like taking pictures of the same person with and without glasses.

Our approach relies on an encoder-decoder architecture where, given an input image with its attributes , the encoder maps to a latent representation , and the decoder is trained to reconstruct given . At inference time, a test image is encoded in the latent space, and the user chooses the attribute values that are fed to the decoder. Even with binary attribute values at train time, each attribute can be considered as a continuous variable during inference to control how much it is perceived in the final image. We call our architecture Fader Networks, in analogy to the sliders of an audio mixing console, since the user can choose how much of each attribute they want to incorporate.

The fundamental feature of our approach is to constrain the latent space to be invariant to the attributes of interest. Concretely, it means that the distribution over images of the latent representations should be identical for all possible attribute values. This invariance is obtained by using a procedure similar to domain-adversarial training (see e.g., schmidhuber1992learning ; dann ; louppe2016learning

). In this process, a classifier learns to predict the attributes

given the latent representation during training while the encoder-decoder is trained based on two objectives at the same time. The first objective is the reconstruction error of the decoder, i.e., the latent representation must contain enough information to allow for the reconstruction of the input. The second objective consists in fooling the attribute classifier, i.e., the latent representation must prevent it from predicting the correct attribute values. In this model, achieving invariance is a means to filter out, or hide, the properties of the image that are related to the attributes of interest. A single latent representation thus corresponds to different images that share a common structure but with different attribute values. The reconstruction objective then forces the decoder to use the attribute values to choose, from the latent representation, the intended image.

Our motivation is to learn a disentangled latent space in which we have explicit control on some attributes of interest, without supervision of the intended result of modifying attribute values. With a similar motivation, several approaches have been tested on the same tasks icgan ; yan2016attribute2image

, on related image-to-image translation problems

image2image ; cyclegan , or for more specific applications like the creation of parametrized avatars lior_avatars . In addition to a reconstruction loss, the vast majority of these works rely on adversarial training in pixel space, which compares during training images generated with an intentional change of attributes from genuine images for the target attribute values. Our approach is different both because we use adversarial training for the latent space instead of the output, but also because adversarial training aims at learning invariance to attributes. The assumption underlying our work is that a high fidelity to the input image is less conflicting with the invariance criterion, than with a criterion that forces the hallucinated image to match images from the training set.

As a consequence of this principle, our approach results in much simpler training pipelines than those based on adversarial training in pixel space, and is readily amenable to controlling multiple attributes, by adding new output variables to the discriminator of the latent space. As shown in Figure 1 on test images from the CelebA dataset liu2015faceattributes , our model can make subtle changes to portraits that end up sufficient to alter the perceived value of attributes while preserving the natural aspect of the image and the identity of the person. Our experiments show that our model outperforms previous methods based on adversarial training on the decoders’ output like icgan in terms of both reconstruction loss and generation quality as measured by human subjects. We believe this disentanglement approach is a serious competitor to the widespread adversarial losses on the decoder output for such tasks.

Figure 1: Interpolation between different attributes (Zoom in for better resolution). Each line shows reconstructions of the same face with different attribute values, where each attribute is controlled as a continuous variable. It is then possible to make an old person look older or younger, a man look more manly or to imagine his female version. Left images are the originals.

In the remainder of the paper, we discuss in more details the related work in Section 2. We then present the training procedure in Section 3 before describing the network architecture and the implementation in Section 4. Experimental results are shown in Section 5.

2 Related work

There is substantial literature on attribute-based and/or conditional image generation that can be split in terms of required supervision, with three different levels. At one extreme are fully supervised approaches developed to model known transformations, where examples take the form of (input, transformation, result of the transformation). In that case, the model needs to learn the desired transformation. This setting was previously explored to learn affine transformations transforming_autoenc , 3D rotations weakly_3D , lighting variations inverse_graphic_networks and 2D video game animations deep_analogy . The methods developed in these works however rely on the supervised setting, and thus cannot be applied in our setup.

At the other extreme of the supervision spectrum lie fully unsupervised methods that aim at learning deep neural networks that disentangle the factors of variations in the data, without specification of the attributes. Example methods are InfoGAN

chen2016infogan , or the predictability minimization framework proposed in schmidhuber1992learning . The neural photo editor brock2016neural disentangles factors of variations in natural images for image edition. This setting is considerably harder than the one we consider, and it may be difficult with these methods to automatically discover high-level concepts such as gender or age.

Our work lies in between the two previous settings. It is related to information as in mathieu2016disentangling . Methods developed for unsupervised domain transfer image2image ; cyclegan ; dtn ; lior_avatars can also be applied in our case: given two different domains of images such as “drawings” and “photograph”, one wants to map an image from one domain to the other without supervision; in our case, a domain would correspond to an attribute value. The mappings are trained using adversarial training in pixel space as mentioned in the introduction, using separate encoders and/or decoders per domain, and thus do not scale well to multiple attributes. In this line of work but more specifically considering the problem of modifying attributes, the Invertible conditional GAN icgan first trains a GAN conditioned on the attribute values, and in a second step learns to map input images to the latent space of the GAN, hence the name of invertible GANs. It is used as a baseline in our experiments. Antipov et al. faceaging

use a pre-trained face recognition system instead of a conditional GAN to learn the latent space, and only focuses on the age attribute. The attribute-to-image approach

yan2016attribute2image is a variational auto-encoder that disentangles foreground and background to generate images using attribute values only. Conditional generation is performed by inferring the latent state given the correct attributes and then changing the attributes.

Additionally, our work is related to work on learning invariant latent spaces using adversarial training in domain adaptation dann , fair classification edwards2015censoring and robust inference louppe2016learning . The training criterion we use for enforcing invariance is similar to the one used in those works, the difference is that the end-goal of these works is only to filter out nuisance variables or sensitive information. In our case, we learn generative models, and invariance is used as a means to force the decoder to use attribute information in its reconstruction.

Finally, for the application of automatically modifying faces using attributes, the feature interpolation approach of upchurch2016deep

presents a means to generate alterations of images based on attributes using a pre-trained network on ImageNet. While their approach is interesting from an application perspective, their inference is costly and since it relies on pre-trained models, cannot naturally incorporate factors or attributes that have not been foreseen during the pre-training.

3 Fader Networks

Let be an image domain and the set of possible attributes associated with images in , where in the case of people’s faces typical attributes are glasses/no glasses, man/woman, young/old. For simplicity, we consider here the case where attributes are binary, but our approach could be extended to categorical attributes. In that setting, , where is the number of attributes. We have a training set , of pairs (image, attribute) . The end goal is to learn from

a model that will generate, for any attribute vector

, a version of an input image whose attribute values correspond to .

Encoder-decoder architecture

Our model, described in Figure 2, is based on an encoder-decoder architecture with domain-adversarial training on the latent space. The encoder

is a convolutional neural network with parameters

that maps an input image to its -dimensional latent representation . The decoder is a deconvolutional network with parameters that produces a new version of the input image given its latent representation and any attribute vector . When the context is clear, we simply use and to denote and . The precise architectures of the neural networks are described in Section 4. The auto-encoding loss associated to this architecture is a classical mean squared error (MSE) that measures the quality of the reconstruction of a training input given its true attribute vector :

The exact choice of the reconstruction loss is not fundamental in our approach, and adversarial losses such as PatchGAN li2016precomputed could be used in addition to the MSE at this stage to obtain better textures or sharper images, as in image2image . Using a mean absolute or mean squared error is still necessary to ensure that the reconstruction matches the original image.

Ideally, modifying in would generate images with different perceived attributes, but similar to in every other aspect. However, without additional constraints, the decoder learns to ignore the attributes, and modifying at test time has no effect.

Learning attribute-invariant latent representations

To avoid this behavior, our approach is to learn latent representations that are invariant with respect to the attributes. By invariance, we mean that given two versions of a same object and that are the same up to their attribute values, for instance two images of the same person with and without glasses, the two latent representations and should be the same. When such an invariance is satisfied, the decoder must use the attribute to reconstruct the original image. Since the training set does not contain different versions of the same image, this constraint cannot be trivially added in the loss.

We hence propose to incorporate this constraint by doing adversarial training on the latent space. This idea is inspired by the work on predictability minimization schmidhuber1992learning and adversarial training for domain adaptation dann ; louppe2016learning where the objective is also to learn an invariant latent representation using an adversarial formulation of the learning objective. To that end, an additional neural network called the discriminator is trained to identify the true attributes of a training pair given . The invariance is obtained by learning the encoder such that the discriminator is unable to identify the right attributes. As in GANs goodfellow2014generative , this corresponds to a two-player game where the discriminator aims at maximizing its ability to identify attributes, and aims at preventing it to be a good discriminator. The exact structure of our discriminator is described in Section 4.

Discriminator objective

The discriminator outputs probabilities of an attribute vector

, where are the discriminator’s parameters. Using the subscript to refer to the -th attribute, we have . Since the objective of the discriminator is to predict the attributes of the input image given its latent representation, its loss depends on the current state of the encoder and is written as:

Adversarial objective

The objective of the encoder is now to compute a latent representation that optimizes two objectives. First, the decoder should be able to reconstruct given and , and at the same time the discriminator should not be able to predict given . We consider that a mistake is made when the discriminator predicts for attribute . Given the discriminator’s parameters, the complete loss of the encoder-decoder architecture is then:


where controls the trade-off between the quality of the reconstruction and the invariance of the latent representations. Large values of will restrain the amount of information about contained in , and result in blurry images, while low values limit the decoder’s dependency on the latent code and will result in poor effects when altering attributes.

Learning algorithm

Overall, given the current state of the encoder, the optimal discriminator parameters satisfy . If we ignore problems related to multiple (and local) minima, the overall objective function is

In practice, it is unreasonable to solve for at each update of . Following the practice of adversarial training for deep networks, we use stochastic gradient updates for all parameters, considering the current value of as an approximation for . Given a training example , let us denote the auto-encoder loss restricted to and the corresponding discriminator loss. The update at time given the current parameters , , and and the training example is:

The details of training and models are given in the next section.

Figure 2: Main architecture. An (image, attribute) pair is given as input. The encoder maps to the latent representation ; the discriminator is trained to predict given whereas the encoder is trained to make it impossible for the discriminator to predict given only. The decoder should reconstruct given . At test time, the discriminator is discarded and the model can generate different versions of when fed with different attribute values.

4 Implementation

We adapt the architecture of our network from image2image . Let

be a Convolution-BatchNorm-ReLU layer with

filters. Convolutions use kernel of size

, with a stride of

, and a padding of

, so that each layer of the encoder divides the size of its input by . We use leaky-ReLUs with a slope of in the encoder, and simple ReLUs in the decoder.

The encoder consists of the following 7 layers:

Input images have a size of . As a result, the latent representation of an image consists of feature maps of size . In our experiments, using 6 layers gave us similar results, while 8 layers significantly decreased the performance, even when using more feature maps in the latent state.

To provide the decoder with image attributes, we append the latent code to each layer given as input to the decoder, where the latent code of an image is the concatenation of the one-hot vectors representing the values of its attributes (binary attributes are represented as and ). We append the latent code as additional constant input channels for all the convolutions of the decoder. Denoting by the number of attributes, (hence a code of size ), the decoder is symmetric to the encoder, but uses transposed convolutions for the up-sampling:

The discriminator is a layer followed by a fully-connected neural network of two layers of size and repsectively.


We found it extremely beneficial to add dropout in our discriminator. We set the dropout rate to in all our experiments. Following image2image , we also tried to add dropout in the first layers of the decoder, but in our experiments, this turned out to significantly decrease the performance.

Discriminator cost scheduling

Similarly to bowman2015generating , we use a variable weight for the discriminator loss coefficient . We initially set to and the model is trained like a normal auto-encoder. Then, is linearly increased to over the first iterations to slowly encourage the model to produce invariant representations. This scheduling turned out to be critical in our experiments. Without it, we observed that the encoder was too affected by the loss coming from the discriminator, even for low values of .

Model selection

Model selection was first performed automatically using two criteria. First, we used the reconstruction error on original images as measured by the MSE. Second, we also want the model to properly swap the attributes of an image. For this second criterion, we train a classifier to predict image attributes. At the end of each epoch, we swap the attributes of each image in the validation set and measure how well the classifier performs on the decoded images. These two metrics were used to filter out potentially good models. The final model was selected based on human evaluation on images from the train set reconstructed with swapped attributes.

Model Naturalness Accuracy
Mouth Smile Glasses Mouth Smile Glasses
Real Image 92.6 87.0 88.6 89.0 88.3 97.6
IcGAN AE 22.7 21.7 14.8 88.1 91.7 86.2
IcGAN Swap 11.4 22.9 9.6 10.1 9.9 47.5
FadNet AE 88.4 75.2 78.8 91.8 90.1 94.5
FadNet Swap 79.0 31.4 45.3 66.2 97.1 76.6
Table 1: Perceptual evaluation of naturalness and swap accuracy for each model. The naturalness score is the percentage of images that were labeled as “real” by human evaluators to the question “Is this image a real photograph or a fake generated by a graphics engine?”. The accuracy score is the classification accuracy by human evaluators on the values of each attribute.

5 Experiments

5.1 Experiments on the celebA dataset

Experimental setup

We first present experiments on the celebA dataset liu2015faceattributes , which contains images of celebrity of shape annotated with attributes. We used the standard training, validation and test split. All pictures presented in the paper or used for evaluation have been taken from the test set. For pre-processing, we cropped images to , and resized them to , which is the resolution used in all figures of the paper. Image values were normalized to . All models were trained with Adam kingma2014adam , using a learning rate of , , and a batch size of . We performed data augmentation by flipping horizontally images with a probability at each iteration. As model baseline, we used IcGAN icgan with the model provided by the authors and trained on the same dataset. 111https://github.com/Guim3/IcGAN

Qualitative evaluation

Figure 3 shows examples of images generated when swapping different attributes: the generated images have a high visual quality and clearly handle the attribute value changes, for example by adding realistic glasses to the different faces. These generated images confirm that the latent representation learned by Fader Networks is both invariant to the attribute values, but also captures the information needed to generate any version of a face, for any attribute value. Indeed, when looking at the shape of the generated glasses, different glasses shapes and colors have been integrated into the original face depending on the face: our model is not only adding “generic” glasses to all faces, but generates plausible glasses depending on the input.

Quantitative evaluation protocol

We performed a quantitative evaluation of Fader Networks on Mechanical Turk, using IcGAN as a baseline. We chose the three attributes Mouth (Open/Close), Smile (With/Without) and Glasses (With/Without) as they were attributes in common between IcGAN and our model. We evaluated two different aspects of the generated images: the naturalness, that measures the quality of generated images, and the accuracy, that measures how well swapping an attribute value is reflected in the generation. Both measures are necessary to assess that we generate natural images, and that the swap is effective. We compare: Real Image , that provides original images without transformation, FadNet AE and IcGAN AE , that reconstruct original images without attribute alteration, and FadNet Swap and IcGAN Swap , that generate images with one swapped attribute, e.g., With Glasses Without Glasses. Before being submitted to Mechanical Turk, all images were cropped and resized following the same processing than IcGAN. As a result, output images were displayed in resolution, also preventing Workers from basing their judgment on the sharpness of presented images exclusively.

Technically, we should also assess that the identity of a person is preserved when swapping attributes. This seemed to be a problem for GAN-based methods, but the reconstruction quality of our model is very good (RMSE on test of , to be compared to for IcGAN), and we did not observe this issue. Therefore, we did not evaluate this aspect.

For naturalness, the first images from the test set such that there are images for each attribute value were shown to Mechanical Turk Workers, for each of the different models presented above. For each image, we asked whether the image seems natural or generated. The description given to the Workers to understand their task showed examples of real images, and examples of fake images ( FadNet AE , FadNet Swap , IcGAN AE , IcGAN Swap ).

The accuracy of each model on each attribute was evaluated in a different classification task, resulting in a total of 15 experiments. For example, the FadNet/Glasses experiment consisted in asking Workers whether people with glasses being added by FadNet Swap effectively possess glasses, and vice-versa. This allows us to evaluate how perceptible the swaps are to the human eye. In each experiment, images were shown ( images per class, in the order they appear in the test set).

In both quantitative evaluations, each experiment was performed by Workers, resulting in samples per experiment for naturalness, and samples per classification experiment on swapped attributes. The results on both tasks are shown in Table 1.

Figure 3: Swapping the attributes of different faces. Zoom in for better resolution.
Quantitative results

In the naturalness experiments, only around of real images were classified as “real” by the Workers, indicating the high level of requirement to generate natural images. Our model obtained high naturalness accuracies when reconstructing images without swapping attributes: , and , compared to IcGAN reconstructions whose accuracy does not exceed , whether it be for reconstructed or swapped images. For the swap, FadNet Swap still consistently outperforms IcGAN Swap by a large margin. However, the naturalness accuracy varies a lot based on the swapped attribute: from for the opening of the mouth, down to for the smile.

Classification experiments show that reconstructions with FadNet AE and IcGAN AE have very high classification scores, and are even on par with real images on both Mouth and Smile. FadNet Swap obtains an accuracy of for the mouth, for the glasses and for the smile, indicating that our model can swap these attributes with a very high efficiency. On the other hand, with accuracies of , and on these same attributes, IcGAN Swap does not seem able to generate convincing swaps.

Figure 4: (Zoom in for better resolution.) Examples of multi-attribute swap (Gender / Opened eyes / Eye glasses) performed by the same model. Left images are the originals.
Multi-attributes swapping

We present qualitative results for the ability of our model to swap multiple attributes at once in Figure 4, by jointly modifying the gender, open eyes and glasses attributes. Even in this more difficult setting, our model can generate convincing images with multiple swaps.

5.2 Experiments on Flowers dataset

We performed additional experiments on the Oxford-102 dataset, which contains about images of flowers classified into categories flowersdataset . Since the dataset does not contain other labels than the flower categories, we built a list of color attributes from the flower captions provided by flowerscaptions . Each flower is provided with different captions. For a given color, we gave a flower the associated color attribute, if that color appears in at least 5 out of the 10 different captions. Although being naive, this approach was enough to create accurate labels. We resized images to . Figure 5 represents reconstructed flowers with different values of the “pink” attribute. We can observe that the color of the flower changes in the desired direction, while keeping the background cleanly unchanged.

Figure 5: Examples of reconstructed flowers with different values of the pink attribute. First row images are the originals. Increasing the value of that attribute will turn flower colors into pink, while decreasing it in images with originally pink flowers will make them turn yellow or orange.

6 Conclusion

We presented a new approach to generate variations of images by changing attribute values. The approach is based on enforcing the invariance of the latent space w.r.t. the attributes. A key advantage of our method compared to many recent models cyclegan ; image2image

is that it generates realistic images of high resolution without needing to apply a GAN to the decoder output. As a result, it could easily be extended to other domains like speech, or text, where the backpropagation through the decoder can be really challenging because of the non-differentiable text generation process for instance. However, methods commonly used in vision to assess the visual quality of the generated images, like PatchGAN, could totally be applied on top of our model.


The authors would like to thank Yedid Hoshen for initial discussions about the core ideas of the paper, Christian Pursch and Alexander Miller for their help in setting up the experiments and Mechanical Turk evaluations. The authors are also grateful to David Lopez-Paz and Mouhamadou Moustapha Cisse for useful feedback and support on this project.