Code to reproduce the results in the paper "Adversarial Learning of Disentangled and Generalizable Representations for Visual Attributes"
Recently, a multitude of methods for image-to-image translation has demonstrated impressive results on problems such as multi-domain or multi-attribute transfer. The vast majority of such works leverages the strengths of adversarial learning in tandem with deep convolutional autoencoders to achieve realistic results by well-capturing the target data distribution. Nevertheless, the most prominent representatives of this class of methods do not facilitate semantic structure in the latent space, and usually rely on domain labels for test-time transfer. This leads to rigid models that are unable to capture the variance of each domain label. In this light, we propose a novel adversarial learning method that (i) facilitates latent structure by disentangling sources of variation based on a novel cost function and (ii) encourages learning generalizable, continuous and transferable latent codes that can be utilized for tasks such as unpaired multi-domain image transfer and synthesis, without requiring labelled test data. The resulting representations can be combined in arbitrary ways to generate novel hybrid imagery, as for example generating mixtures of identities. We demonstrate the merits of the proposed method by a set of qualitative and quantitative experiments on popular databases, where our method clearly outperforms other, state-of-the-art methods. Code for reproducing our results can be found at: https://github.com/james-oldfield/adv-attribute-disentanglementREAD FULL TEXT VIEW PDF
Multi-domain image-to-image translation has gained increasing attention
Recent advances of image-to-image translation focus on learning the
Recent works have shown that a rich set of semantic directions exist in ...
Attribute editing has become an important and emerging topic of computer...
Image-to-image translation aims to learn the mapping between two visual
Image translation methods typically aim to manipulate a set of labeled
Hairstyle transfer is challenging due to hair structure differences in t...
Code to reproduce the results in the paper "Adversarial Learning of Disentangled and Generalizable Representations for Visual Attributes"
Image-to-image translation methods learn a non-linear mapping of an image in a source domain to its corresponding image in a target domain. The notion of domain varies, depending on the application. For instance, in the context of super-resolution the source domain consists of low-resolution images while the corresponding high-resolution images belong to the target domain. In a visual attribute transfer setting, ‘domain’ denotes face images with the same attribute that describes either intrinsic facial characteristics (e.g., identity, facial expressions, age, gender, etc.) or capture external sources of appearance variation related, for example, to different poses or illumination conditions. In the latter setting, the task is to change the attributes for a given face image.
Recently, deep generative models trained through adversarial learning have been shown capable of generating naturalistic images, that look authentic to the human observer. Deep generative models for image-to-image implement a mapping between between two  or multiple  image domains, in a paired  or unpaired [29, 3, 19, 14, 10] fashion. Despite their merits in pushing forward the state of the art in image generation, we posit that widely adopted image-to-image translation models (namely CycleGAN , Pix2Pix  and StarGAN ) also come with a set of shortcomings. For example, none of these methods provide semantically meaningful latent structure linked to specific attributes. In addition, the generated images do not cover the entire variance of the target domain and in most cases a single image is generated given a discrete attribute value, which is also required at test time, and thus can not generate images when the attribute label has not been observed during training. For example, changing the “smile” attribute of a facial image will always lead to a smile of specific intensity (Fig. 1)
In this paper, we propose a method that facilitates learning disentangled, generalizable, and continuous representations of visual data with respect to attributes acting as sources of variation. The proposed method can readily be used to generate varying intensity expressions in images (Fig. 1), while also being equipped with several other features that enable generating novel hybrid imagery on unseen data. Key contributions of this work are summarized below.
Firstly, a novel loss function for learning disentangled representations is proposed. The loss function ensures that latent representations corresponding to an attribute (a) have discriminative power within the attribute class, while (b) being invariant to other sources linked to other attributes. For example, given a facial image with a specific identity and expression, representations of theidentity
attribute should classify all identities well, but should fail to classify the expressions well: that is, the conditional distribution of the class posteriors should be uniform over the expression labels.
Secondly, we propose a novel method that encourages the disentangled representations to be generalizable. The particular loss function enables the network to generate realistic images that are classified appropriately, even when sampling representations from different samples, and without requiring paired data. This enables the representations to well-capture the attribute variation, as shown in Fig. 1, in contrast to other methods that simply consider a target label for transfer. We highlight that the expected value of the latent codes over a single attribute an be considered in our case as equivalent to an attribute label.
Finally, we provide a set of rigorous experiments to demonstrate the capabilities of the proposed method on databases such as MultiPIE, BU-3DFE, and RaFD. Given generalizable and disentangled representations on a multitude of attributes (e.g., expression, identity, illumination, gaze, color), the proposed method can perform arbitrary combinations of the latent codes in order to generate novel imagery. For example, we can swap an arbitrary number of attribute representations amongst test samples to perform intensity-preserving multiple attribute transfer and synthesis, without knowing the test labels. Most interestingly, we can combine an arbitrary number of e.g., identity latent codes, in order to generate novel subjects that preserve a mixture of characteristics, and can be particularly useful for tasks such as data augmentation. Both qualitative and quantitative results corroborate the improved performance of the proposed approach over state-of-the-art image-to-image translation methods.
Generative Adversarial Networks 
(GANs) approach deep generative models training from a game theory perspective by solving a minimax game. That is, GANs learn a distribution that matches the real data distribution and generate new image by sampling from the estimated distribution. Such an adversarial learning approach has been successfully employed in a wide range of computer vision tasks, e.g.,[18, 25, 13]. Conditional variants of GANs , condition the generator on a particular class label. This allows fine-grained control over the generator in targeting particular modes of interest, facilitating predictable manipulation and generation of imagery [11, 26, 1, 23].
In a vanilla GAN paradigm however, there is no way to impose a particular semantic structure on the random noise vector from the prior distribution. Consequently, it is hard to drive desired change in generated images via latent space manipulation. Such limitations have been mitigated in by imposing structure on the noise prior and also in the context of Variational Autoencoders (VAE) [21, 4, 15]. Similarly, InfoGAN  learn disentangled representations in a completely unsupervised manner. Nevertheless, the aforementioned deep generative models tend to yield blurry results, and having very low-dimensional bottlenecks often means the resolution of the generated imagery is compromised . As opposed to VAE-based models, the proposed method is able to synthesize sharp, realistic images.
Several GANs-based image-to-image translation models have achieved great success in both a paired and unpaired image-to-image translation setting [11, 29] by combining traditional reconstruction losses (e.g. reconstruction penalties) with adversarial terms to enhance the visual clarity of the model’s outputs. In CycleGAN , DiscoGAN  and DualGAN  the so-called ‘cycle-consistency’ loss facilitates unpaired image-to-image domain translation. Coupled GAN  and its extension 
assume a shared-latent space to learn a joint distribution of different, unpaired, domains. Recently, StarGAN enables image-to-image translation across multiple domains with the use of a single conditional generative model. The latter can flexibly translate between multiple target domains with the generator being a function of both the input data and target domain label. FaderNet  translates input images to new ones by manipulating attributes values via incorporating the discriminator onto the latent space.
In this section, we provide a detailed description of the methodology proposed in this work, which focuses on learning disentangled and generalizable latent representations of visual attributes. Concretely, in Section 3.1 we describe the generative model employed in this work. In Section 3.2, we introduce the proposed loss functions and method towards disentangling these representations in latent space, such that they (i) well-capture variations that relate to a given attribute by enriching features with discriminative power (e.g., for identity), and (ii) fail to classify any other attributes well, by encouraging the classifier posterior distribution over values of other attributes (e.g., expression) to be uniform. In Section 3.3, we describe the optimization procedure that is tailored towards encouraging recovered representations to be generalizable, that is, can be utilized towards generating novel, realistic images from arbitrary samples and attributes. The full objective function employed is described in Section 3.4, while finally, implementation details regarding the full network that is trained in an end-to-end fashion are provided in Section 3.5. An overview of the proposed method is illustrated in Fig. 2
Given dataset with samples, we assume that each sample is associated with a set of attributes that act as sources of visual variation (such as identity, expression, illumination). If not omitted, the superscript denotes that is the -th sample in the dataset or mini-batch. We further assume a set of labels corresponding to each attribute . We aim to recover disentangled, latent representations that capture variation relating to attribute , while being invariant to variations sourced from the remaining attributes. We therefore assume the following generative model
where is an encoder mapping to a space that preserves variance for attribute , while is a decoder mapping back to input space. Note that when , the corresponding representations represent variation present in an image that does not relate to any attribute. For example, assuming a dataset of facial images, if with corresponding to identity and to expression, captures other variations such as e.g., background information. To ensure that the set of representations faithfully reconstruct the original image , we impose a standard reconstruction loss,
Our aim is to define a transformation that is able to generate disentangled representations for specific attributes that act as sources of variation, while being invariant with respect to other variations present in our data. To this end, we introduce a method for training the encoders arising in our generative model (Eq. 1), where resulting representations have discriminative power over attribute , while yielding maximum entropy class posteriors for each of the other attributes. In effect, this prevents any contents relating to other attributes besides from arising in the resulting representations. To tackle this problem, we propose a composite loss function as discussed below.
We firstly employ a loss function that is reminiscent of a typical classification loss, to ensure that the obtained representations well-classify variation that is related to attribute . This is done by feeding the representations directly into the the fully connected layers of a classifier , and minimizing the negative log-likelihood of the ground truth labels given an input sample,
Classification losses, as defined above, ensure that the learned transformations leads to representations that are enriched with information related with the particular ground-truth label, and have been employed in different forms in other works such as . However, as we demonstrate experimentally in Section 4.3, it is not reasonable to expect that the classification loss alone is sufficient to disentangle the latent representations from other sources of variations arising from the other attributes. Hence, to further encourage disentanglement, we impose an additional “disentanglement” loss on the conditional label distributions of the classifiers, given the corresponding representations. In more detail, we posit that the class posterior for each attribute given the latent represenations induced by the encoder for every other distinct attribute
should be a uniform distribution. We impose this soft-constraint by minimizing the cross-entropy loss between a uniform distribution and the classifier class posteriors, that is
where iterates over all other attributes and is the number of classes for attribute . In other words, we impose that each encoder must map to a representation that is correctly classified with respect to only the relevant attribute , and that the representation is such that the conditional label distribution given its mapping, for every other attribute, has maximum entropy. In other words, this loss function filters-out information related to variation arising from other attributes. We note that the final loss function averages over all attributes, that is . In particular, for we ensure that the representations obtained via are invariant to all attribute variations, while capturing only variations that are not related to any of the attributes. This is an important distinction, as we can not always assume that all image variation is related to the given attributes.
The loss function described in Section 3.2 encourages the representations generated by the corresponding encoders to capture the variation induced by attribute , while being invariant to variation arising from other sources. In this section, we provide a simple, effective method for ensuring that the derived representations are generalizable over unseen data (e.g. new identities in facial images), while at the same time yielding the expected semantics in the generated images. We which utilizes the classifiers’ distributions to learn generalizable representations without requiring the ground-truth pair for any combination of labels.
We assume a mini-batch of size . During each forward pass, we randomly shuffle the representations for each attribute along the batch dimension, which when passed through the decoder provides a new synthesized sample, ,
where are random integers indexing the mini-batch data. In essence, this leads to a synthesized sample . Since we know what value the ground-truth labels for the attributes should be taking in the synthesized sample , we can enforce a classification loss on by minimizing the negative log-likelihood of the expected classes for each attribute,
We highlight that the above loss is at an advantage over paired methods (such as ) in that we don’t require direct access to the corresponding target .
In order to induce adversarial learning in the proposed model, and encourage generated images to match the data distribution, we further impose an adversarial loss in tandem with Eq. 6,
This ensures that even when representations are shuffled accross data points, the synthesized sample will both (i) be classified according to the sample/embedding combination (Eq. 7), as well as (ii) constitute a realistic image.
The proposed method is trained end-to-end, using the full objective as grouped by the set of variables we are optimizing for
where is the combined loss for the discriminator, is for the classifiers, and is for the encoders and decoder. We control the relative importance of each loss term with a corresponding hyperparameter.
At train-time we sample mini-batches of size at random. Each iteration we shuffle the attributes’ encodings along the batch dimension before concatenating depth-wise, and feeding into the decoder (i.e. Eq. 5), to train the network to be able to flexibly pair any combination of values of the attributes. Network Architecture. We define encoder instances (one for each explicitly modeled attribute, and an additional encoder to capture the remaining sources of variation). Each encoder is a separate convolutional encoder based on the first half of  up to the bottleneck. Our decoder depth-concatenates all latent encodings and then upsamples via  to reconstruct the input image. We adopt the deeper PatchGAN variant proposed in  for our discriminator. The classifiers are simple shallow CNNs–trained on the images in the training set to correctly classify the labels of attribute
–with a final dense layer that outputs the logits for the classes of the appropriate attribute.
In this section, we present a set of rigorous qualitative and quantitative experiments on multiple datasets to validate the proposed method, and verify the derived representations are disentangled, generalizable, and continuous. In more detail, we experiment on databases such as MultiPIE, BU-3DFE, and RaFD. We utilize the proposed method to learn disentangled and generalizable representations on various categorical attributes (), including identity, expression, illumination, gaze, and color. In more detail, we explore the properties of the proposed model in Section 4.3. In Section 4.4, we detail experiments related to expression synthesis in comparison to SOTA image-to-image translation models on the test set of each database. Subsequently, in Section 4.5 results on arbitrary multi-attribute transfer are presented, where as can be clearly seen the proposed representations well-capture attribute variation. Finally, in Section 4.6
, we further evince the generalizable nature of the latent representations for each attribute. By performing weighted combinations of the represenations over multiple samples, we can generate novel unseen data, as for example novel identities. In the same section, we also perform latent space interpolations to demonstrate the continuous properties of the representations. Finally, we note that for all experiments, we adopt the Wasserstein-GP GAN objective and set .
MultiPIE. The MultiPIE  dataset consists of over 750,000 images, including a challenging range of variation. We use the forward-facing subset of MultiPIE, jointly modelling attributes ‘identity’, ‘expression’ and ‘illumination’. We use 686 images for each of the emotions ‘neutral’, ‘scream’, ‘squint’, and ‘surprise’ for our training set, and holdout the first 10 identities for the test set. We utilise the same dataset splits across all models to ensure a fair comparison. BU-3DFE (BU). The BU  dataset is comprised of 100 identities, and a wide range of age, gender, race, and expression intensities. We utilize the 2D frontal projections for our training set, reserving 10 identities for the test set. RaFD. The RaFD  dataset contains 67 individuals, each in 8 expressions, each with 3 gazes, and in 3 poses. We holdout the first 8 images for the test set, and use the remaining for training. We use only the front-facing poses.
StarGAN. We consider StarGAN  to be a state-of-the-art method for unpaired image-to-image translation between multiple domains, and consequently benchmark our model’s performance against it. CycleGAN. To tackle unpaired image-to-image translation between two domains, CycleGAN  employ a cycle constraint to ensure the generator’s inverse function approximates an identity function. We benchmark against CycleGAN by training it pairwise between each expression and ‘neutral’. Pix2Pix. Pix2Pix  utilises a conditional GAN for learning the mapping between paired tuples from two domains.
In this section, we present a set of exploratory experiments that verify the properties of the proposed model. In more detail, we train our model on the MultiPIE database, and plot the conditional PMFs of classifiers for expression and identity in Fig. 3. As can be seen, the latent representations bear discriminative power for the desired attributes, whilst producing uniformly distributed class assignments for other attributes. This is not the case when we do not include the proposed loss, where variance from other attributes is contained. This is further evidenced by an ablation study, showing that without the disentanglement losses, identity components and ghosting artefacts from clothing are prone to mistakenly fall into the expression representations. Finally, in Fig. 4 we train our model on the MultiPIE database and visualize the recovered embeddings with and without the disentanglement loss. As can be seen, without the disentanglement loss, clusters with the same expression are split according to illumination, while our representations appear invariant to such variations and find the correct expression clusters.
In this section we present a set of expression synthesis results across several databases. In more detail, most GAN-based methods are able to synthesize facial images given a specific target expression (e.g., “neutral” to “smile”). Our method is able to capture the variability associated with each attribute (expression in this case), and we can therefore generate varying intensity images of the same expression. We can also obtain a representation equivalent to an expression label as used in other models (such as ) by simply taking the expected value of the embeddings, , while we can also generate combinations of embeddings to obtain novel data. We compare our method against SOTA image-to-image translation models, applied on the test sets of BU, MultiPIE and RaFD in Fig. 5 (a), (b), and (c) respectively. As can be seen, the proposed method can generate sharp, realistic images of target expressions, and can capture expression intensity as can be particularly seen in Fig. 5 (a). Quantitative results on classification accuracy and FID are show in in Table 1. We note that since CycleGAN and pix2pix can only be trained between pairs of domains, a separate instance is trained for each pair of expressions. Our method outperforms compared techniques on nearly all databases and metrics.
In this section, we present experiments that involve arbitrary, intensity-preserving transfer of attributes. While most other methods require a target domain or a label, in our case we can simply swap the obtained representations arbitrarily from sample to sample, while also being able to perform operations on them. In more detail, in Fig. 6 we demonstrate that the proposed method is able to map to disentangled and generalizable attributes, by successfully transferring intrinsic facial attributes such as identity, expression, and gaze, as well as appearance based attributes such as illumination and image color. Note that this demonstrates the generality of our method, handling arbitrary sources of variation. Finally, we also show that it is entirely possible to transfer several attributes jointly, by using the corresponding representations. We show examples where we simultaneously transfer expression and illumination, as well as expression and gaze.
Our method produces generalizable representations that can be combined in many ways to synthesize novel samples. This is an important feature that can be used towards tasks such as data augmentation, as well as enhancing the robustness of classifiers for face recognition. In this section, we present experiments to demonstrate that the learned representations are both generalizable and continuous in the latent domain for all attributes. Consequently, latent contents can be manipulated accordingly in order to synthesize entirely novel imagery. Compared to dedicated methods for interpolation employing GANs such as, our method immediately allows for categorical attributes and also preserves intensity when interpolating between specific values. In more detail, in Fig. 7 we decode the convex combination of 3 identity embeddings from distinct samples, and synthesize a realistic mixture of the given identities rendered as a new person. Finally, in Fig. 8, we show that we can smoothly interpolate the representations for attributes such as identity, illumination, and expression.
In this paper, we presented a novel method for learning disentangled and generalizable representations in an adversarial manner. The proposed method offers many benefits over other methods, while learning a meaningful latent structure that corresponds semantically to image characteristics. We provided experimental evidence to showcase some of the possibilities that arise with such a rich latent structure, including being able to interchange and manipulate latent contents to generate a large, rich gamut of new hybrid imagery.
Image-to-image translation with conditional adversarial networks.2016.