Conditional Autoencoders with Adversarial Information Factorization

11/14/2017 ∙ by Antonia Creswell, et al. ∙ Imperial College London 0

Generative models, such as variational auto-encoders (VAE) and generative adversarial networks (GAN), have been immensely successful in approximating image statistics in computer vision. VAEs are useful for unsupervised feature learning, while GANs alleviate supervision by penalizing inaccurate samples using an adversarial game. In order to utilize benefits of these two approaches, we combine the VAE under an adversarial setup with auxiliary label information. We show that factorizing the latent space to separate the information needed for reconstruction (a continuous space) from the information needed for image attribute classification (a discrete space), enables the capability to edit specific attributes of an image.



There are no comments yet.


page 5

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Latent space generative models, such as generative adversarial networks (GANs) (Goodfellow et al., 2014; Radford et al., 2015)

and variational autoencoders (VAEs)

(Kingma & Welling, 2013), learn a mapping from a latent encoding space to a data space, for example, the space of natural images. It has been shown that the latent space learned by these models is often organized in a near-linear fashion (Radford et al., 2015; Kingma & Welling, 2013), whereby neighbouring points in latent space map to similar images in data space. Certain “directions” in latent space correspond to changes in the intensity of certain attributes. In the context of faces, for example, directions in latent space would correspond to the extent to which someone is smiling. This may be useful for image synthesis where one can use the latent space to develop new design concepts (Dosovitskiy et al., 2017; Zhu et al., 2016), edit an existing image (Zhu et al., 2016) or synthesize avatars (Wolf et al., 2017; Taigman et al., 2016). This is because semantically meaningful changes may be made to images by manipulating the latent space (Radford et al., 2015; Zhu et al., 2016; Larsen et al., 2015).

One avenue of research for latent space generative models has been class conditional image synthesis (Chen et al., 2016; Odena et al., 2016; Mirza & Osindero, 2014), where an image of a particular object category is synthesized. Often, object categories may be sub-divided into fine-grain sub-categories. For example, the category “dog” may be split into further sub-categories of different dog breeds. Work by Bao et al. (2017) propose latent space generative models for synthesizing images from fine-grained categories, in particular for synthesizing different celebrities’ faces conditional on the identity of the celebrity.

Rather than considering fine-grain categories, we propose to take steps towards solving the different, but related problem of image attribute manipulation. To solve this problem we want to be able to synthesize images and only change one element or attribute of its content. For example, if we are synthesizing faces we would like to edit whether or not a person is smiling. This is a different problem to fine-grain synthesis; we want to be able to synthesize two faces that are similar, with only a single chosen attribute changed, rather than synthesizing two different faces. The need to synthesis two faces that are similar makes the problem of image attribute manipulation more difficult than the fine-grain image synthesis problem; we need to learn a latent space representation that separates an object category from its attributes.

In this paper, we propose a new model that learns a factored representation for faces, separating attribute information from the rest of the facial representation. We apply our model to the CelebA (Liu et al., 2015) dataset of faces and control several facial attributes.

Our contributions are as follows:

  1. Our core contribution is the novel cost function for training a VAE encoder to learn a latent representation which factorizes binary facial attribute information from a continuous identity representation (Section 3.2).

  2. We provide an extensive quantitative analysis of the contributions of each of the many loss components in our model (Section 4.1).

  3. We obtain classification scores that are competitive with state of the art (Zhuang et al., 2018)

    using the classifier that is already incorporated into the encoder of the VAE (Section


  4. We provide qualitative results demonstrating that our latent variable, generative model may be used to successfully edit the ‘Smiling’ attribute in more than of the test cases (Section 4.3).

  5. We discuss and clarify the distinction between conditional image synthesis and image attribute editing (Section 5).

  6. We present code to reproduce experiments shown in this paper: (provided after review).

2 Latent space generative models

Latent space generative models come in various forms. Two state-of-art generative models are Variational Autoencoders (VAE) (Kingma & Welling, 2013) and Generative Adversarial Networks (GAN). Both models allow synthesis of novel data samples from latent encodings, and are explained below in more detail.

2.1 Variational Autoencoder (VAE)

Variational autoencoders (Kingma & Welling, 2013) consist of an encoder and decoder

; oftentimes these can be instantiated as neural networks,

and respectively, with learnable parameters, and . A VAE is trained to maximize the evidence lower bound (ELBO) on , where is the data-generating distribution. The ELBO is given by:


where is a chosen prior distribution such as . The encoder predicts, and for a given input and a latent sample, , is drawn from as follows: then . By choosing a multivariate Gaussian prior, the -divergence may be calculated analytically (Kingma & Welling, 2013)

. The first term in the loss function is typically approximated by calculating the reconstruction error between many samples of

and .

New data samples, which are not present in the training data, are synthesised by first drawing latent samples from the prior, , and then drawing data samples from . This is equivalent to passing the samples through the decoder, .

VAEs offer both a generative model, , and an encoding model, , which are useful as starting points for image editing in the latent space. However, samples drawn from a VAE are often blurred (Radford et al., 2015).

2.2 Generative Adversarial Networks (GAN)

An alternative generative model, which may be used to synthesize much sharper images, is the Generative Adversarial Network (GAN) (Goodfellow et al., 2014; Radford et al., 2015). GANs consist of two models, a generator, , and a discriminator,

, both of which may be implemented using convolutional neural networks

(Radford et al., 2015; Denton et al., 2015). GAN training involves these two networks engaging in a mini-max game. The discriminator, , is trained to classify samples from the generator, , as being ‘fake’ and to classify samples from the data-generating distribution, , as being ‘real’. The generator is trained to synthesize samples that confuse the discriminator; that is, to synthesize samples that the discriminator cannot distinguish from the ‘real’ samples. The objective function is given by:


where is the distribution of synthesized samples, sampled by: , then , where is a chosen prior distribution such as a multivariate Gaussian.

2.3 Best of both GAN and VAE

The vanilla GAN model does not provide a simple way to map data samples to latent space. Although there are several variants on the GAN that do involve learning an encoder type model (Dumoulin et al., 2016; Donahue et al., 2016; Li et al., 2017), only the approach presented by Li et al. (2017) allows data samples to be faithfully reconstructed. The approach presented by Li et al. (2017)

requires adversarial training to be applied to several high dimensional distributions. Training adversarial networks on high dimensional data samples remains challenging

(Arjovsky & Bottou, 2017) despite several proposed improvements (Salimans et al., 2016; Arjovsky et al., 2017). For this reason, rather than adding a decoder to a GAN, we consider an alternative latent generative model that combines a VAE with a GAN. In this arrangement, the VAE may be used to learn an encoding and decoding process, and a discriminator may be placed after the decoder to ensure higher quality of the data samples outputted from the decoder. Indeed, there have been several suggestions on how to combine VAEs and GANs (Bao et al., 2017; Larsen et al., 2015; Mescheder et al., 2017) each with a different structure and set of loss functions, however, none are designed specifically for attribute editing.

The content of image samples synthesized from a vanilla VAE or GAN depends on the latent variable , which is drawn from a specified random distribution, . For a well-trained model, synthesised samples will resemble samples in the training data. If the training data consists of images from multiple categories, synthesized samples may come from any, or possibly a combination, of those categories. For a vanilla VAE, it is not possible to choose to synthesize samples from a particular category. However, conditional VAEs (and GANs) (Chen et al., 2016; Odena et al., 2016; Mirza & Osindero, 2014) provide a solution to this problem as they allow synthesis of class-specific data samples.

2.4 Conditional VAEs

Autoencoders may be augmented in many different ways to achieve category-conditional image synthesis (Bao et al., 2017)

. It is common to append a one-hot label vector,

, to inputs of the encoder and decoder (Sohn et al., 2015). However, for small label vectors, relative to the size of the inputs to the encoder and the decoder model, it is possible for the label information, , to be ignored111The label information in is less likely to be ignored when has relatively high dimensions compared to (Yan et al., 2016).. A more interesting approach, for conditional (non-variational and semi-supervised) autoencoders is presented by Makhzani et al. (2015), where the encoder outputs both a latent vector, , and an attribute vector, . The encoder is updated to minimize a classification loss between the true label, , and . We incorporate a similar architecture into our model with additional modifications to the training of the encoder for the reasons explained below.

There is a drawback to incorporating attribute information in the way described above (Makhzani et al., 2015) when the purpose of the model is to edit specific attributes, rather than to synthesize samples from a particular category. We observe that in this naive implementation of conditional VAEs, varying the attribute (or label) vector, , for a fixed can result in unpredictable changes in synthesized data samples, . Consider for example the case where, for a fixed , modifying does not result in any change in the intended corresponding attribute. This suggests that information about the attribute one wishes to edit, , is partially contained in rather than solely in . Similar problems have been discussed and addressed to some extent in the GAN literature (Chen et al., 2016; Mirza & Osindero, 2014; Odena et al., 2016), where it has been observed that label information in is often ignored during sample synthesis.

In general, one may think that and should be independent. However, if attributes, , that should be described by remain unchanged for a reconstruction where only is changed, this suggests that contains most of the information that should have been encoded within . We propose a process to separate the information about from using a mini-max optimization involving , , the encoder , and an auxiliary network . We refer to our proposed process as ‘Adversarial Information Factorization’.

2.5 Adversarial Information Factorization

For a given image of a face, , we would like to describe the face using a latent vector, , that captures the identity of the person, along with a single unit vector, , that captures the presence, or absence, of a single desired attribute, . If a latent encoding, , contains information about the desired attribute, , that should instead be encoded within the attribute vector, , then a classifier should be able to accurately predict from . Ideally, contains no information about and so, ideally, a classifier should not be able to predict from . We propose to train an auxiliary network to predict from accurately while updating the encoder of the VAE to output values that cause the auxiliary network to fail. If contains no information about the desired attribute, , that we wish to edit, then the information can instead be conveyed in since must still contain that information in order to minimize reconstruction loss. We now formalize these ideas.

3 Method

In what follows, we explain our novel approach to training the encoder of a VAE, to factor (separate) out information about from , such that . We integrate this novel factorisation method into a VAE-GAN. The GAN component of the model is incorporated only to improve image quality. Our main contribution is our proposed adversarial method for factorising the label information, , out of the latent encoding, .

3.1 Model Architecture

(a) Current work
(b) Previous work (Bao et al., 2017)
Figure 1: (a) Current work (adversarial information factorization) This figure shows our model where the core, shown in blue, is a VAE with information factorization. Note that is split in two, and , to obtain both a latent encoding, , and the label, , respectively. is the decoder and the auxiliary network. The pink blocks show how a GAN architecture may be incorporated by placing a discriminator, , after the encoder, , and training to classify decoded samples as “fake” and samples from the dataset as “real”. Finally, the dashed lines show how decoded samples may be passed back through the encoder to obtain a label, , which may be used to obtain a gradient, that contains label information, for updating the decoder of the VAE. For simplicity, the regularization is not shown in this figure. (b) Previous work: cVAE-GAN (Bao et al., 2017) Architecture most similar to our own. Note that there is no auxiliary network performing information factorization and a label, , is predicted only for the reconstructed image, rather than for the input image ().

A schematic of our architecture is presented in Figure 1. In addition to the encoder, , decoder, , and discriminator, , we introduce an auxiliary network, , whose purpose is described in detail in Section 3.2. We use to indicate the predicted label of a reconstructed data sample. Additionally, we incorporate a classification model into the encoder so that our model may easily be used to perform classification tasks.

The parameters of the decoder, , are updated with gradients from the following loss function:


where and are regularization coefficients, is a reconstruction loss and is a classification loss on reconstructed data samples. The classification loss, , provides a gradient containing label information to the decoder, which otherwise the decoder would not have (Chen et al., 2016). The GAN loss is given by , where and are vectors of ones and zeros respectively. Note that is the binary cross-entropy loss given by . The discriminator parameters, , are updated to minimize .

The parameters of the encoder, , intended for use in synthesizing images from a desired category, may be updated by minimizing the following function:


where and are additional regularization coefficients; and is the classification loss on the input image. Unfortunately, the loss function in Equation (4) is not sufficient for training an encoder used for attribute manipulation. For this, we propose an additional network and cost function, as described below.

3.2 Adversarial Information Factorisation

To factor label information, , out of we introduce an additional auxiliary network, , that is trained to correctly predict from . The encoder, , is simultaneously updated to promote to make incorrect classifications. In this way, the encoder is encouraged not to place attribute information, , in . This may be described by the following mini-max objective:


where is the latent output of the encoder.

Training is complete when the auxiliary network, , is maximally confused and cannot predict from , where is the true label of . The encoder loss is therefore given by:


We call the conditional VAE-GAN trained in this way an Information Factorization cVAE-GAN (IFcVAE-GAN). The training procedure is presented in Algorithm 1.

1:procedure Training cVAE with information factorization
2:     for  in  do

is no. of epochs

3:          forward pass through all networks
4:          is the training data
8:          output of the auxiliary network
9:         # Calculate updates,
10:         # do updates
12:          update the encoder to max.
14:          update auxiliary network to min.
15:     end for
16:end procedure
Algorithm 1 Training Information Factorization cVAE-GAN (IFcVAE-GAN): The prior, .

3.3 Attribute Manipulation

To edit an image such that it has a desired attribute, we encode the image to obtain a , the identity representation, append it to our desired attribute label, , and pass this through the decoder. We use and to synthesize samples in each mode of the desired attribute e.g. ‘Smiling’ and ‘Not Smiling’. Thus, attribute manipulation becomes a simple ‘switch flipping’ operation in the representation space.

4 Results

In this section, we show both quantitative and qualitative results to evaluate our proposed model. We begin by quantitatively assessing the contribution of each component of our model in an ablation study. Following this we perform facial attribute classification using our model. We use a standard deep convolutional GAN, DCGAN, architecture for the ablation study (Radford et al., 2015), and subsequently incorporate residual layers (He et al., 2016) into our model in order to achieve competitive classification results compared with a state of the art model (Zhuang et al., 2018). We finish with a qualitative evaluation of our model, demonstrating how our model may be used for image attribute editing. For our qualitative results we continue to use the same residual networks as those used for classification, since these also improved visual quality.

We refer to any cVAE-GAN that is trained without an term in the cost function as a naive cVAE-GAN and a cVAE-GAN trained with the term as an Information Factorization cVAE-GAN (IFcVAE-GAN).

4.1 Quantifying contributions of each component to the final model

Table 1 shows the contribution of each component of our proposed model. We consider reconstruction error and classification accuracy on synthesized data samples. Smaller reconstruction error indicates better reconstruction, and larger classification values ( and ) suggest better control over attribute changes. To obtain and values, we use an independent classifier, trained on real data samples to classify ‘Smiling’ vs. ‘Not Smiling’ . We apply the trained classifier to two sets of image samples synthesized using and . If the desired attributes are changed, the classification scores should be high for both sets of samples. Whereas if the desired attributes remain unchanged, the classifier is likely to perform well on only one of the sets, indicating that the attribute was not edited but fixed. Note that all original test data samples for this experiment were from the ‘Smiling’ category. The results are shown in Table 1, where the classification scores (, ) may be interpreted as the proportion of samples with the desired attributes and the MSE error interpreted as the fidelity of reconstruction. From Table 1, we make the following observations:

Effect of : Using does not provide any clear benefit. We explored the effect of including this term since a similar approach had been proposed in the GAN literature (Chen et al., 2016; Odena et al., 2016) for conditional image synthesis (rather than attribute editing). To the best of our knowledge, this approach has not been used in the VAE literature. This term is intended to maximise by providing a gradient containing label information to the decoder, however, it does not contribute to the factorization of attribute information, , from .

Effect of Information Factorization: Without our proposed term in the encoder loss function, the model fails completely to perform attribute editing. Since + , this strongly suggests that samples are synthesized independently of and that the synthesized images are the same for and .

Effect of on its own: For completeness, we also evaluated our model without but with to test the effect of on its own. Though similar approaches have been successful for category conditional image synthesis, it was not as successful on the attribute editing task. Similarly, as above, + , suggesting that samples are synthesized independently of . Furthermore, , which suggests that none of the synthesized images had the desired attribute (‘Not Smiling’), i.e. all samples are with the attribute ‘Smiling’. This supports the use of , when training models for attribute editing, over despite the promotion of the latter in the GAN literature (Chen et al., 2016; Odena et al., 2016) for category specific sample synthesis.

Model MSE
Ours (without ) 0.028 81.3% 100.0%
With 0.028 93.8% 93.8%
Without , without 0.028 18.8% 81.3%
Without , with 0.027 0.0% 100.0%
Table 1: What are the essential parts of the IFcVAE-GAN? This table shows how novel components of the IFcVAE-GAN loss function affect mean squared (reconstruction) error, MSE, and the ability to edit facial attributes in an image. Ability to edit attributes is quantified by a pair of classification accuracies, and , on samples synthesized with and respectively. These values may be thought of as the proportion of synthesized images that have the desired attribute. We used hyper-parameters: .

4.2 Facial Attribute Classification

We have proposed a model that learns a representation, , for faces such that the identity of the person, encoded in , is factored from a particular facial attribute. We achieve this by minimizing the mutual information between the identity encoding and the facial attribute encoding to ensure that , while also training as an attribute classifier. Our training procedure encourages the model to put all label information into , rather than . This suggests that our model may be useful for facial attribute classification.

To further illustrate that our model is able to separate the representation of particular attributes from the representation of the person’s identity, we can measure the model’s ability, specifically the encoder, to classify facial attributes. We proceed to use directly for facial attribute classification and compare the performance of our model to that of a state of the art classifier proposed by Zhuang et al. (2018). Results in Figure 2 show that our model is highly competitive with a state of the art facial attribute classifier, outperforming Zhuang et al. (2018) on out of categories and remaining competitive in most other attributes. These results demonstrate that the model is effectively factorizing out information about the attribute from the identity representation.

Figure 2: Facial Attribute Classification. We compare the performance of our classifier, , to a state of art classifier (Zhuang et al., 2018). Our model outperforms that of Zhuang et al. (2018) for out of attributes and achieves comparable results for most other attributes.

4.3 Qualitative Results

In this section, we focus on attribute manipulation (described previously in Section 3.3). Briefly, this involves reconstructing an input image, , for different attribute values, .

We begin by demonstrating how a naive cVAE-GAN (Bao et al., 2017) may fail to edit desired attributes, particularly when it is trained to achieve low reconstruction error. The work of Bao et al. (2017) focused solely on the ability to synthesise images with a desired attribute, rather than to reconstruct a particular image and specifically edit one of its attributes. It is challenging to learn a representation that both preserves identity and allows factorisation (Higgins et al., 2016). Figure 3(c,e) shows reconstructions when setting for ‘Not Smiling’ and for ‘Smiling’. We found that the naive cVAE-GAN (Bao et al., 2017) failed to synthesise samples with the desired target attribute ‘Not Smiling’. This failure demonstrates the need for models that can deal with both reconstruction and attribute-editing. Note that we achieve good reconstruction by reducing weightings on the and GAN loss terms, using and respectively. We trained the model using RMSProp (Tieleman & Hinton, 2012) with momentum in the discriminator.

We train our proposed IFcVAE-GAN model using the same optimiser and hyper-parameters that were used for the Bao et al. (2017) model above. We also used the same number of layers (and residual layers) in our encoder, decoder and discriminator networks as those used by Bao et al. (2017). Under this set-up, we used the following additional hyper-parameters: in our model. Figure 3 shows reconstructions when setting for ‘Not Smiling’ and for ‘Smiling’. In contrast to the naive cVAE-GAN (Bao et al., 2017), our model is able to achieve good reconstruction, capturing the identity of the person, while also being able to change the desired attribute. Table 2 shows that the model was able to synthesize images with the ‘Not Smiling’ attribute with a success rate, compared with a success rate using the naive cVAE-GAN Bao et al. (2017).

(a) Original Smiling Faces.
(b) Smiling (ours).
(c) Smiling (Bao et al., 2017).
(d) Not Smiling (ours).
(e) (Failed) Not Smiling (Bao et al., 2017).
Figure 3: Reconstructions, ‘Smiling’ and ‘Not Smiling’. The goal here was to reconstruct the face, changing only the desired ‘Smiling’ attribute. This demonstrates how other conditional models (Bao et al., 2017) may fail at the image attribute editing task, when high quality reconstructions are required. Both models are trained with the same optimizers and hyper-parameters.

. Model MSE Ours (with residual layers) 0.011 98% 100% Bao et al. (2017) (with residual layers) 0.011 22% 85%

Table 2: Comparing our model, the IFcVAE-GAN, to the naive cVAE-GAN Bao et al. (2017). Ability to edit attributes is quantified by a pair of classification accuracies, and , on samples synthesized with and respectively. These values may be thought of as the proportion of synthesized images that have the desired attribute. We see that both models achieve comparable (MSE) reconstruction errors, however, only our model is able to synthesize images of faces without smiles. A complete ablation study for this model (with residual layers) is given in the appendix (Table 3).

4.4 Editing Other Facial Attributes

In this section we apply our proposed method to manipulate other facial attributes where the initial samples, from which the ’s are obtained, are test samples whose labels are indicating the presence of the desired attribute (e.g. ‘Blonde Hair’). In Figure 4, we observe that our model is able to both achieve high quality reconstruction and edit the desired attributes.

Figure 4: Editing other attributes. We obtain a , the identity representation, by passing an image, through the encoder. We append with a desired attribute label, , and pass this through the decoder. We use and to synthesize samples in each mode of the desired attribute

We have presented the novel IFcVAE-GAN model, and (1) demonstrated that our model learns to factor attributes from identity, (2) performed an ablation study to highlight the benefits of using an auxiliary classifier to factorize the representation and (3) shown that our model may be used to achieve competitive scores on a facial attribute classification task. We now discuss this work in the context of other related approaches.

5 Comparison to Related Work

We have used adversarial training (involving an auxiliary classifier) to factor attribute label information, , out of the encoded latent representation, . Schmidhuber (Schmidhuber, 2008) performs similar factorization of the latent space, ensuring that each component of the encoding is independent. This is achieved by learning an encoding such that other elements in the encoding may not be predicted from a subset of remaining elements. We use related concepts, with additional class label information, and incorporate the encoding in a generative model.

Our work has the closest resemblance to the cVAE-GAN architecture (see Figure 1) proposed by Bao et al. (2017). cVAE-GAN is designed for synthesizing samples of a particular class, rather than manipulating a single attribute of an image from a class. In short, their objective is to synthesize a “Hathway” face, whereas our objective would be to make “Hathway smiling” or “Hathway not smiling”, which has different demands on the type of factorization in the latent representation. Separating categories is a simpler problem since it is possible to have distinct categories and changing categories may result in more noticeable changes in the image. Changing an attribute requires a specific and targeted change with minimal changes to the rest of the image. Additionally, our model simultaneously learns a classifier for input images unlike the work by Bao et al. (2017).

In a similar vein to our work, Antipov et al. (2017) acknowledge the need for “identity preservation” in the latent space. They achieve this by introducing an identity classification loss between an input data sample and a reconstructed data sample, rather than trying to separate information in the encoding itself. Similar to our work, Larsen et al. (2015) use a VAE-GAN architecture. However, they do not condition on label information and their image “editing” process is not done in an end-to-end fashion 222Larsen et al. (2015) traverse the latent space along an attribute vector found by taking the mean difference between encodings of several samples with the same attribute. Additionally, in Figure 5 of Larsen et al. (2015), changing one attribute results in other attributes changing, for example in the bottom row when changing the ‘blonde hair’ attribute, the woman’s make-up changes too..

Our work highlights an important difference between category conditional image synthesis (Bao et al., 2017) and attribute editing in images: what works for category conditional image synthesis may not work for attribute editing. Furthermore, we have shown (Section 4.1) that for attribute editing to be successful, it is necessary to factor label information out of the latent encoding.

In this paper, we have focused on latent space generative models, where a small change in latent space results in a semantically meaningful change in image space. Our approach is orthogonal to a class of image editing models, called “image-to-image” models, which aim to learn a single latent representation for images in different domains. Recently, there has been progress in image-to-image domain adaptation, whereby an image is translated from one domain (e.g. a photograph of a scene) to another domain (e.g. a painting of a similar scene) (Zhu et al., 2017; Liu et al., 2017; Liu & Tuzel, 2016). Image-to-image methods may be used to translate smiling faces to non-smiling faces (Liu et al., 2017; Liu & Tuzel, 2016), however, these models (Liu et al., 2017; Liu & Tuzel, 2016) require significantly more resources than ours333While our approach requires a single generative model, the approaches of Liu et al. (2017); Liu & Tuzel (2016) require a pair of generator networks, one for each domain.. By performing factorization in the latent space, we are able to use a single generative model, to edit an attribute by simply changing a single unit of the encoding, , from to or vice versa.

6 Conclusion

We have proposed a novel perspective and approach to learning representations of images which subsequently allows elements, or attributes, of the image to be modified. We have demonstrated our approach on images of the human face, however, the method is generalisable to other objects. We modelled a human face in two parts, with a continuous latent vector that captures the identity of a person and a binary unit vector that captures a facial attribute, such as whether or not a person is smiling. By modelling an image with two separate representations, one for the object and the other for the object’s attribute, we are able to change attributes without affecting the identity of the object. To learn this factored representation we have proposed a novel model aptly named Information Factorization conditional VAE-GAN. The model encourages the attribute information to be factored out of the identity representation via an adversarial learning process. Crucially, the representation learned by our model both captures identity faithfully and facilitates accurate and easy attribute editing without affecting identity. We have demonstrated that our model performs better than pre-existing models intended for category conditional image synthesis (Section 4.3), and have performed a detailed ablation study (Table 1) which confirms the importance and relevance of our proposed method. Indeed, our model is highly effective as a classifier, achieving state of the art accuracy on facial attribute classification for several attributes (Figure 2). Our approach to learning factored representations for images is both a novel and important contribution to the general field of representation learning.



Ablation Study For Our Model With Residual Layers

For completeness we include a table (Table 3) demonstrating an ablation study for our model with the residual network architecture discussed in Section 4.3, note that this is the same architecture that was used by Bao et al. (2017). Table 3 and additionally, Figure 5, demonstrate the need for the loss and shows that increased regularisation reduces reconstruction quality. The table also shows that there is no significant benefit to using the loss. These findings are consistent with those of the ablation study in the main body of the text for the IFcVAE-GAN with a the GAN architecture of Radford et al. (2015).

Model MSE Acc. ()
Ours (with residual layers) 0.011 98% 100.0% 92%
Higher levels of regularization 0.020 100% 100% 92%
With , 0.010 96% 100% 91%
Without , 0.013 28% 91% 91%
Without , with , , 0.019 33% 96% 89%
Bao et al. (2017), 0.011 22% 85% n/a*
Table 3: What are the essential parts of the IFcVAE-GAN (with residual layers)? This table shows how novel components of the IFcVAE-GAN loss function affect mean squared (reconstruction) error, MSE, and the ability to edit facial attributes in an image. Ability to edit attributes is quantified by a pair of classification accuracies, and , on samples synthesized with and respectively. These values may be thought of as the proportion of synthesized images that have the desired attribute. We use hyper-parameters: . We also show classification accuracy (Acc.) of .
*Note that the model of Bao et al. (2017) does not incorporate a classifier.
(a) Original Smiling Faces.
(b) Smiling (ours).
(c) Smiling (without ).
(d) Not Smiling (ours).
(e) (Failed) Not Smiling (without ).
Figure 5: Reconstructions, ‘Smiling’ and ‘Not Smiling’, with and without . The goal here was to reconstruct the face, changing only the desired ‘Smiling’ attribute. This figure demonstrates the need for the term in our model. Both models are trained with the same optimizers and hyper-parameters.