PyTorch code for the paper: Toward a Controllable Disentanglement Network
This paper addresses two crucial problems of learning disentangled image representations, namely controlling the degree of disentanglement during image editing, and balancing the disentanglement strength and the reconstruction quality. To encourage disentanglement, we devise a distance covariance based decorrelation regularization. Further, for the reconstruction step, our model leverages a soft target representation combined with the latent image code. By exploring the real-valued space of the soft target representation, we are able to synthesize novel images with the designated properties. To improve the perceptual quality of images generated by autoencoder (AE)-based models, we extend the encoder-decoder architecture with the generative adversarial network (GAN) by collapsing the AE decoder and the GAN generator into one. We also design a classification based protocol to quantitatively evaluate the disentanglement strength of our model. Experimental results showcase the benefits of the proposed model.READ FULL TEXT VIEW PDF
A crucial problem in learning disentangled image representations is
To learn disentangled representations of facial images, we present a Dua...
Attribution editing has shown remarking progress by the incorporating of...
Autoencoder networks are unsupervised approaches aiming at combining
Facial attribute editing has mainly two objectives: 1) translating image...
In the present study, we propose to implement a new framework for estima...
In this paper, we study a new representation-learning task, which we ter...
PyTorch code for the paper: Toward a Controllable Disentanglement Network
One of the long-standing challenges in machine learning community is to learn interpretable and robust representations of sensory data. Disentangling the hidden factors of variation provides the possibility of overcoming such a challenge[2, 39]. In a disentangled (or factorial) image representation, the generative factors of images correspond to independent subsets of the latent dimensions, such that changing a single factor causes a change in a single latent unit while being invariant to others . For example, a disentangled representation of face images could contain a set of latent units, each of which is sensitive to a specific facial attribute (i.e., generative factor) such as gender, age or wearing eyeglasses [26, 10]. According to Lake et al. 
, disentangled representations have the potential to boost the performance of state-of-the-art machine learning approaches in several situations, including transfer learning and zero-shot learning. Besides, it is claimed that such representations are more robust against adversarial attacks[1, 3]
, and are also beneficial to design more robust multi-stage reinforcement learning agents
. Other scenarios where disentangled representations could play a role, such as novelty detection and information compression, can be found in the work of Ridgeway.
There is substantial literature on learning disentangled image representations with deep neural networks. Most of these models feature an explicit or implicit regularization to induce the non-correlation between hidden representations, and then implement image editing by tweaking the representation components of interest accordingly. Example models are FadNet that resorts to an adversarial-like training strategy , or the ones based on the cross covariance regularization [5, 19]. Recently, several attempts have been made to investigate the disentanglement ability of the variational autoencoder (VAE) [22, 38]. To penalize the correlation between different dimensions in the representation, VAE-based models, such as -VAE , DIP-VAE , JointVAE , and -TCVAE 
, emphasize the importance of minimizing the Kullback–Leibler divergence between the well-designed approximate posterior distribution and the disentangled prior distribution (e.g., an isotropic unit Gaussian) over the latent variable. However, all of the VAE’s suffer from the nuisance blurring effects as observed in output images. By contrast, another set of works extending the generative adversarial network (GAN) is able to generate realistic-looking images when handing disentanglement-related tasks, including IcGAN  for face image editing, DR-GAN 
for pose-invariant face recognition, StarGAN
for image-to-image translation, and Soft-Gated Warping-GAN for pose-guided person image synthesis. The key point underlying these models is that the GAN generator learns to map the input image (or the representation of input) to the target image as accurately as possible, provide that the label or the domain information is given. In that case, the adversarial training process implicitly makes the inferred representation uncorrelated to the label [36, 45], or makes the input image and the domain information interdependent for synthesizing target images [6, 8].
While the models mentioned above have shown promise in specific disentanglement tasks, they pay limited attention to simultaneously tackling the following two problems, which are very important but challenging as learning disentangled representations, namely controlling degree of disentanglement, and preserving image quality.
The first problem arises from controlling the degree of disentanglement during image editing, which means generating novel images with the designated attributes and, meanwhile, with the specific attribute intensities. For instance, given a face image, one may desire not only to synthesize a new smiling face, but also to synthesize a sequence of faces with expressions varying from no smile to toothy smile. Up to now, most of the existing disentanglement models [33, 27, 36, 9, 20, 7, 11, 6], however, mainly focus on whether the model can generate images with or without attributes of interest, rather than controlling the attribute intensities involved in the synthesized images. In practice, a more subtle manipulation of images is often most useful. This property could allow several potential applications, such as automatic face image editing and image color rendering .
The second problem is how to generate images with the target attributes while preserving the core object identity and the image quality. Generally speaking, if the learned target representation part is not as independent as possible of other parts, changing one generative factor could induce changes of other factors, even falsifying the object identity. This phenomenon has been observed in many existing works [5, 23, 4, 15, 32, 24, 3]. For instance, as shown in , heightening the baldness intensity of a face concurrently leads to the visually-perceptible change of the object identity. Additionally, many disentanglement models based on autoencoder (AE) [5, 23, 15, 24, 10, 3] usually produce blurry output images, thus giving rise to the image quality degradation.
In this paper, we present a simple yet effective model named as Controllable Disentanglement Network, or CDNet, to address the aforementioned two problems. The overall network architecture of our model for training is illustrated in Fig. 1. Specifically, CDNet combines the AE with the GAN to construct a new deep neural network, which is further divided into four parts: , , , and Dis. The two encoders and are used to learn the soft target representation and the latent representation , respectively. The representation acts to capture class- or attribute-related information by solving the supervised classification subtask, while the representation serves to extract information different from those in via the proposed decorrelation regularization. The decoder of the AE (labeled Dec) works to reconstruct input image when given the two groups of representations and . By viewing the AE decoder as the GAN generator (labeled Gen), the reconstructed image is also treated as the fake image, with the purpose of fooling the discriminator (labeled Dis) such that the Dis cannot distinguish the fake image from the real image. To achieve image editing, only the two well-trained encoders and the decoder are utilized, and modifications of values are performed with regard to the designated classes or attributes.
In summary, we highlight our contributions as follows.
, to reconstruct original and synthesize new images. This soft target representation is a probability representation at training time, and its element scale implicitly indicates how much class or attribute information is included in input image. Under this setting, one is able to decrease or increase the specific element scales to modify attribute intensities of the synthesized image at testing phase (see SectionIII-D for details).
To improve the perceptual quality of the generated images, we extend the AE architecture with GAN, where the GAN generator and the AE decoder are tied as the same one by parameter sharing and joint training. This model combination is inspired by the VAE/GAN , with the difference in that we build the CDNet based on the deterministic AE, and introduce a parallel encoder to learn the soft target representation. The new integrated model shows improved ability to reconstruct images and learn disentangled representations (see Sections IV-B, IV-C, and IV-D for empirical comparisons).
To quantitatively compare the disentanglement strength of our model, an evaluation protocol is designed. To our best knowledge, this is the first work that leverages classification to analyze how the effect of representation scales with disentanglement performance (see Section IV-E for details).
We will provide a public PyTorch implementation of our model after the release of this paper.
Due to its nature of interpretability and robustness, disentangled representations have been attracting increasing attention in recent years. Here, we divide the related disentanglement models into three prominent groups: AE-based models [5, 23, 15, 17, 26, 24], GAN-based models [4, 20, 7, 32, 6], and integrations of AE’s and GAN’s [33, 27, 36, 9, 11]. In the following, we provide a detailed survey of this topic from these three perspectives.
By constraining the latent variables to be invariant to image attributes, the basic AE model can be extended to handle the disentanglement task. For instance, the FadNet 
learns a classifier to predict the attribute given the latent representation, while the latent representation inferred by the encoder tries to prevent the classifier from predicting the correct attribute values. This adversarial-like process enables the model, in an implicit manner, to learn a latent representation containing information different from the attributes. By contrast, Cheunget al.  employ an explicit decorrelation regularization, based on the cross covariance (XCov), to approach the same goal. Our CDNet is also built on the basic AE, but with three notable differences from the aforementioned models. First, in contrast to the AE-based models that take advantage of the class or attribute label during image editing, the CDNet implements image editing by using the inferred soft target representation, thus our model is applicable to scenarios where label information is unavailable. Second, compared with the XCov regularization, the distance covariance (dCov) we use for disentanglement encourages statistical independence rather than non-correlation between variables , leading to a stronger disentanglement ability (see Sections IV-E and IV-F for details). Third, the usage of a GAN in our model also markedly improves the perceptual quality of the output images.
The vanilla VAE [22, 38] has been shown to learn disentangled representations, but with limited disentanglement ability on simple datasets such as FreyFaces or MNIST. Higgins et al.  and Kumar et al.  refine VAE to learn controllable disentangled factors, implemented by putting implicit independence constraints on the approximate posterior over latent variables. Kulkarni et al.  achieve disentanglement based on a special training scheme, where pairs of rendered images that differ only in one factor of variation are provided. Another VAE-like model 
utilizes the vector arithmetic technique
to control attribute intensities. We note that those models can be trained stably in general, however, they are prone to obtain blurry images. Besides, the unsupervised learning strategy adopted in many of these models cannot guarantee the non-correlation between learned representations, and thus the change of one attribute (e.g., smiling) may induce changes of other attributes (e.g., hairstyle or azimuth) as observed in. In fact, Locatello et al.  have theoretically shown that the unsupervised disentanglement learning is fundamentally impossible without inductive biases both on models and datasets. This conclusion indicates that the role of supervision is crucial , which is coincident with the idea of using labeled training data in our model.
The plain GAN  does not show any apparent disentanglement properties, nevertheless, subsequent works have enhanced GANs. Donahue et al.  propose the semantically decomposed GANs that learn to decompose the latent code into an identity-related portion and observation-related portion, thus modifying face images by varying the observation vector. Focusing on person image generation, Ma et al.  use an adversarial network to learn mappings from Gaussian noise to the embedding feature space, which provides more control over the foreground, background, and pose information of the input image. In the InfoGAN , a subset of facial attributes is changed by manipulating the learned categorical codes, but with no conspicuous visual difference among generated images (such as the “Hair style” variation shown therein). By coupling two GANs together, the DiscoGAN  leverages the cross-domain relations to perform the facial attribute conversion task. The StarGAN , one of the sate-of-the-art multi-domain image translation models, achieves facial attribute manipulation by a single generator and shows remarkable ability to synthesize high quality images. However, StarGAN’s such superiority is obtained only when the number of operated attribute domains is small, and thus it is less effective for balancing the disentanglement strength and the reconstruction quality across a mass of different attributes (e.g., all 40 facial attributes in CelebA face images). Moreover, it’s worth noting that many GAN-based models still suffer from the problems of training instability and model collapse 
, which also make the disentangled representation learning more challenging.
A natural way to alleviate aforementioned problems is to combine AE with GAN, thereby leveraging both models’ strengths in a complementary manner. To approach this goal, several existing works explore the adversarial training strategy in the latent space of AEs. The main point of these models is making the GAN discriminator indistinguishable 1) between the aggregated posterior of the latent variable and an arbitrary prior ; or 2) between samples in latent space and encoded data (rather than prior samples) ; or 3) between joint samples of the data and the corresponding latent variable from the encoder and joint samples from the decoder .
The IcGAN  and the VAE/GAN  are another two works falling into this line of literature. In IcGAN, an encoder is added into the conditional GAN  to learn a mapping from the image space to the representation space, and thus implementing image editing by changing the conditional information inferred from the real image. In VAE/GAN, the VAE decoder and the GAN generator are viewed as the same mapping by parameter sharing and joint training, and the GAN discriminator acts to measure sample similarity in the feature space. Although the network architecture is similar, the representation learning method of our CDNet differs substantially from these models. First, the attribute representations learned by IcGAN and VAE/GAN are still correlated with each other, degrading the ability to manipulate images subtly. By contrast, the decorrelation regularization facilitates the CDNet to learn independent representations, which enables the model to control disentanglement at testing phase. Second, both of the pixel reconstruction error and the feature reconstruction error are explored in CDNet, and thus the stability of model training and the perceptual quality of output images are all improved. We conducted a series of experiments to compare these two models with the proposed CDNet in Sections IV-B, IV-C, and IV-D.
We propose the CDNet, a novel model combining AE with GAN, that advances the state-of-the-art towards jointly solving the problems of the image disentanglement controlling and the image quality balance between disentanglement and reconstruction. The CDNet architecture is shown in Fig. 1, and it consists of four components: , , , and Dis. Specifically, the encoder aims to learn the soft target representation to extract class or attribute information from the discrete label . This goal is approached by training to solve a supervised classification task. Another encoder serves to learn the latent representation under two constraints: being informative to reconstruct input image and being uncorrelated with (even independent of) . Here the independence between and is induced by the proposed decorrelation regularization. The Dec takes as input the representations and to reconstruct input image. Because CDNet collapses the AE decoder and the GAN generator into one, the reconstructed image is also treated as the fake image generated by the Gen. With this setting, we train the Dis to distinguish fake image from real image , and also use the middle layer representations of the Dis to compute the feature reconstruction error.
In the following, we first formulate all local losses used to train different network components. Then we derive the integrated loss function and elaborate the associated training algorithm of our CDNet. After that, several practical considerations for implementation are provided. Finally, the method to manipulate images with controllable disentanglement is illustrated in two applications.
In addition to the aforementioned symbols, let denote the mini-batch version of the input image , the mini-batch size, and similarly define , , , and for the reconstructed/fake image , label , soft target representation , and latent representation , respectively.
Minimizing the classification loss enables the encoder to inject the class or attribute information into the soft target representation . We select two classification loss functions to fit the following two application cases, respectively.
Case 1: For the multiclass scenario (e.g., a handwritten digit belongs to only one of the 10 classes), we first use the softmax nonlinearity to scale each element of . Then the cross entropy between the discrete labels and the soft target representations is computed as the classification loss.
For the multilabel scenario (e.g., face images with or without smiling, eyeglasses, and blond hair attributes), we first use the sigmoid nonlinearity to scale each element of. Then the binary cross entropy between and is derived as the classification loss.
We propose to leverage the distance covariance (dCov)  based regularization to learn the latent representation , which is expected to be independent of the soft target representation . To obtain this decorrelation loss (or regularization), we first compute the by distance matrices and containing all pairwise distances:
where is the -norm. Then take all doubly centered distances:
where is the th row mean, is the th column mean, and is the grand mean of the distance matrix . The notation is similar for the values. Finally, the squared sample distance covariance, treated as our decorrelation loss, is simply the arithmetic average of the products :
By minimizing the decorrelation loss in Eq. (1) to approach zero, the soft target representation and the latent representation would tend to be independent. This conclusion is supported by the following Theorem 1. Let and denote two random vectors, is the squared sample distance covariance between and , and indicates the number of pairwise sample points.
Suppose two random vectors and satisfy and . If , then almost surely and are independent.
The proof can be established by using the Definition 3, Theorem 2, and Theorem 3 in . See the Appendix for completeness.
By comparison, Cheung et al.  use the following cross covariance (XCov) to facilitate disentanglement:
where and . It is worth emphasizing that one of the most important differences between and XCov is that, minimizing encourages the independence
between two random variables, while minimizing XCov encourages thenon-correlation. In this regard, the should induce stronger disentanglement than the XCov. Additionally, our model is also compatible with XCov, and replacing with XCov in CDNet also achieves improved disentanglement performance over the model in  (see Sections IV-C and IV-D for empirical comparisons).
The reconstruction principle of CDNet is similar to the basic AE, that is, the soft target representations and the latent representations are first computed by the two encoders, respectively, and then fed to the decoder to reconstruct original images:
We use the reconstruction loss to measure the difference between original images and reconstructions . A common choice is the mean squared error (MSE) computed in pixel space:
where indicates the image dimensionality.
in this paper. We conjecture that lacking the meaningful spatial correlation properties of original images causes the quality degradation of reconstructions. The hidden representation of a deep convolutional neural network, however, can extract such spatial correlation properties from the input image[27, 17]. Inspired by this observation, we compute the feature-matching difference  as the additional reconstruction loss:
where is the dimensionality of the th layer of the GAN discriminator and denotes the hidden representation of at that layer. It’s worth noting that the perceptual loss [12, 18], computed by some pretrained high-performing CNN such as VGG , can also measure the high-level feature difference between two images. But the perceptual loss is not widely applicable when the training data are not in image format (e.g., audio or text data), or when the training data are images but the size of image is smaller than the acceptable image size to the pretrained model. By comparison, the feature-matching difference loss (5) is customized and learned on the fly at each iteration step, and thus enabling our model to be flexible and extensible to handle data disentanglement tasks.
The final reconstruction loss consists of two parts:
where controls the trade-off between reconstructions of global features (i.e., ) and local details (i.e., ). By minimizing these two local reconstruction losses together, the CDNet is enforced to restore identity-preserving images with high-level structures. In practice, we have observed that using the pixel-wise reconstruction loss also makes the adversarial training more stable.
The goal of incorporating GAN into CDNet is to improve the perceptual quality of the output images. In the GAN part of CDNet, the generator Gen maps the inferred soft target representation and the latent representation to image space, while the discriminator Disestimates the probability that a sample belongs to the data distribution. The GAN is trained such that the Dis can tell apart real from fake images, and meanwhile the Gen can generate images that “fool” the Dis. To this end, we need to maximize/minimize the following adversarial loss
with respect to the Dis/Gen.
Note that the exact choice of the GAN model (and so the adversarial loss) is not fundamental in our CDNet, since the plain GAN described here is adequate to improve the image perceptual quality. To obtain images with better visual fidelity, several advanced GAN models such as PatchGAN  and WGAN-GP  could be employed.
We train our combined model with the integrated loss
where the exact expression of each local loss is presented above, and balances the quality of the reconstruction and the strength of the disentanglement. All of the local losses included in Eq. (8) are complementary to each other, enabling the CDNet to address the two crucial disentanglement problems mentioned in Section I.
We train each model component of CDNet with the associated local losses. More specifically, we first train the encoder to solve a supervised classification problem, which is implemented by minimizing the classification loss w.r.t. . After that, we fix the and just use it to infer the soft target representation of input. For another encoder , the decorrelation loss and the reconstruction loss are minimized to update its parameters . The parameters of the decoder (and so, the generator), , are modified based on the minimization of and . The adversarial loss is also related to the discriminator, and is used to learn parameters . For clarity, we summarize the training algorithm for CDNet in Algorithm 1. More details about the parameter setting can be found in Section IV-A3.
In addition to the aforementioned recipe for training the CDNet, we also adopt the following techniques demonstrated in  to stabilize the training process in practice.
Appending the soft target representation to each layer of the decoder: The soft target representation inferred by the encoder contains class- or attribute-related information. This discriminative information is utilized by the decoder in two different ways. For each of the fully-connected layers of the decoder, we concatenate the soft target representation and the hidden layer representation together as a whole input to the next layer. For all the convolutions of the decoder, we append the soft target representation as additional constant input channels.
Decorrelation loss scheduling: To avoid that the decorrelation regularization dominates the parameter updating of the encoder , which would destroy the model’s reconstruction ability, we use a variable weight for the regularization parameter . That is, we linearly increase the to a target value over the early training process, and then clamp it for the remaining training process. By doing so, the effect of decorrelation regularization is gradually imposed on the learning of the latent representation.
Dropout: We use the dropout  in all fully-connected layers, except the final layer, of the two encoders and the discriminator. In our experiments, we found that dropout is beneficial to prevent the encoder and the discriminator Dis from overfitting, and is also helpful for the encoder to learn a latent representation being as independent as possible of the soft target representation.
For image editing, the key operation is to modify the value of the soft target representation accordingly. Based on the two application cases described in Section III-A1, we give two corresponding methods to manipulate images.
In Case 1, taking the handwritten digit as an example, we aim to generate a new digit with the handwriting style similar to the given digit. As shown in Fig. 2, we first employ the two encoders, and , to infer the soft target representation and the latent representation of the given digit “1” in boldface. Then we modify by exchanging the third element (corresponding to the digit 2 class) and the maximum element (ideally corresponding to digit 1 class), while keeping remaining elements fixed. In this way, only two elements of at most are exchanged, and thus the representation structure with component summation of 1 is preserved. Finally, we feed the modified and the unchanged to the decoder to generate the new digit “2” which is also in boldface.
In Case 2, with the face image as an example, the goal is to synthesize a new face with the desired attribute and intensity while preserving the core identity. As we can see from Fig. 2, the overall procedure is similar to the first case, only with the difference in modifying . Actually, in order to generate a new face with eyeglasses, we just replace the original (near) zero value corresponding to “Eyeglasses” attribute with the new value (e.g., 3.5) in . Note that during image editing the modified attribute value is not necessarily restricted in , meaning it can also take other real values outside this interval111We empirically found that the extended interval is large enough for our model to generate new face images with various facial attributes.. Specifically, the small value (near 0 or less than 0) indicates that the synthesized image tends to exclude the target attribute, while the big value (near 1 or greater than 1) implies that the synthesized image prefers containing that attribute. In this regard, the continuous attribute value can be viewed as a sliding knob, the magnitude of which controls how much a specific attribute could be perceivable in the final image. As illustrated in , by modifying attribute values in such an exaggerated way, the soft target representation is able to cover a wide range that the network was never trained on and we will get meaningful generalization (see Section IV-D for empirical evidence).
We conduct five groups of experiments to substantiate the benefits of our CDNet model. First, we show that the combination of AE and GAN is able to improve the quality of reconstructed images. Second, we verify that the proposed decorrelation regularization and the image manipulation methods are competent to disentangle factors of variation. Third, by synthesizing images with various attributes and attribute intensities, we qualitatively illustrate the CDNet’s ability to control the degree of disentanglement. Fourthly, we propose a classification based protocol to quantitatively compare the disentanglement strength of the CDNet. Finally, we perform an ablation study to investigate the effectiveness of different loss terms. Additional results are provided in the supplementary material. Before presenting our experimental results, we introduce the experimental setup in detail.
We use two representative datasets for the two application cases described in Section III-A1. The first one is MNIST , which contains 70,000 grayscale handwritten digit images with pixels for each and scaled to . We randomly split the dataset into 50,000 training, 10,000 validation, and 10,000 test samples, respectively. The discrete label has the one-hot vector form. The second dataset is CelebA , which consists of 202,599 RGB face images of celebrities. For pre-processing, all face images are first center-cropped and then downsampled to RGB pixels and scaled to . We use images for training, for validation, and for test as used in several earlier works. Additionally, the discrete label is represented by a binary vector with dimensionality 40, where each dimension corresponds to one attribute with value 1 indicating containing that attribute and 0 not.
We evaluate our CDNet with three baselines which have similar network architectures to ours, and we give all models as follows.
AE-XCov : a pure AE-based model that leverages the cross covariance (XCov) to learn the middle-layer uncorrelated representations.
IcGAN : a model introducing an encoder to GAN to implement the inference mechanism, and it achieves disentanglement via modifying the discrete labels inferred by the encoder.
CDNet-XCov (ours): an instantiation of our CDNet model, where the XCov is used as the decorelation loss.
CDNet-dCov (ours): another instantiation of the CDNet model, which utilizes the distance covariance (dCov)-based regularization, proposed in this paper, to learn independent representations.
The AE-XCov serves to illustrate the problem of blurring effects produced by the pure AE-based models. The IcGAN and VAE/GAN are selected to confirm that the decorrelation regularization employed in our model is beneficial to learn disentangled representations, and that the integration of both the pixel-wise and the feature-wise reconstruction errors is helpful to improve the reconstruction quality. The CDNet-dCov is compared with the CDNet-XCov to empirically verify that under the same settings, the dCov regularization can facilitate models to generate easier-perceptible attributes than the XCov.
, respectively. For the CDNet, the two encoders have the same architecture, which consists of convolution layers followed by fully-connected layers. The main part of decoder is symmetric to encoder, but using deconvolution (a.k.a. transposed convolution or fractional striding) for the up-sampling. The AE-XCov is built with the similar way to CDNet, except that we append the soft target representation to each layer of the CDNet decoder to strength the influence of class or attribute information on output images. The architecture details can be found in the supplementary material.
We perform the hyperparameter selection according to the validation-set performance. Specifically, for both of the two datasets,takes the value of 1 to balance the two reconstruction errors. The appeared in Algorithm 1 weights the reconstruction ability of decoder v.s. fooling the discriminator, equaling to 0.01. For the MNIST dataset, is linearly increased to 1 over the first 50,000 iterations, while for the CelebA dataset,
is gradually increased until it reaches 0.05 across the first 50,000 iterations. The CDNet models are trained with the RMSProp optimizer, where we set the learning rateand a batch size of 100 for MNIST, the learning rate and a batch size of 128 for CelebA. With a single NVIDIA GeForce GTX 1080 GPU, training CDNet-XCov takes about 1.33 hours on MNIST and 11.17 hours on CelebA; for CDNet-dCov, the training time is about 1.34 hours on MNIST and 11.23 hours on CelebA.
We first visualize the reconstruction results to make a qualitative comparison between different models. As shown in Fig. 3 (a), when reconstructing handwritten digit images from MNIST, all models consistently perform well in terms of the visual perception quality. However, the performance gap becomes large as reconstructing face images from CelebA. As we can see from Fig. 3 (b), reconstructions of the AE-XCov contain the main structure of the input face image, but lose the details due to the blurring effect. The IcGAN on the contrary works well to recover texture features, while apparently falsifying the core object identity in reconstructed images. Although the VAE/GAN approaches a balance between the reconstruction accuracy and the visual fidelity, it cannot recover a few specific local features, such as hair texture and eyeglasses. By contrast, our two CDNet models can reconstruct images with higher visual fidelity. The results demonstrate that as a combination of AE and GAN, the CDNet enjoys two advantages for reconstruction tasks, i.e., preserving the core object identity and simultaneously recovering local detail features.
To quantitatively evaluate the reconstruction ability of the proposed model, we utilize three well-known image quality assessment indexes, namely root-mean-square error (RMSE), peak signal-to-noise ratio (PSNR), and multi-scale structure similarity (SSIM), to assess the reconstructed images’ quality. As can be seen from Table I
, the two instantiations of the CDNet significantly outperform the other models across all three evaluation metrics. We conclude that in the CDNet, the integration of the pixel-wise reconstruction error (from AE) and the feature-wise reconstruction error (from GAN) provides an effective way to improve the reconstruction quality.
In this group of experiments, we first use all models to generate new handwritten digits with the designated handwriting styles, such as boldface, italic, and broad shape. The image manipulation method is described as the Case 1 in Section III-D. As shown in Fig. 4 (a), the AE-XCov presents a limited disentanglement ability, such as synthesizing new digits slightly leaning to the left (corresponding to the test digit “2”). The disentanglement performance of the VAE/GAN is not stable, as many synthesized digits cannot reflect the category nature (see Fig. 4 (c)). We attribute this drawback to that in VAE/GAN, there still exist strong correlations between the learned representations of different digit classes. Both of the two CDNets, as well as the IcGAN, are able to generate new digits with the same style as originals, demonstrating the disentanglement of style from class.
Second, we aim to synthesize new faces with the specific facial attributes while preserving the core identity. The manipulation method is described as the Case 2 in Section III-D. And the disentanglement performance of different models on CelebA face images is illustrated in Fig. 5. We can observe that the AE-XCov can produce new faces with the desired attributes, but the blurring effects obviously degrade the output images’ visual quality. For the IcGAN model, although target attributes are clearly involved in the resulting images, the core object identities have been changed as occurred in the reconstruction task. The synthetic results of VAE/GAN turn out to be really sensitive to the attribute modifications. In particular, when leveraging the VAE/GAN to add the “Eyeglasses” or “Mustache” attribute into the original female face, the hairstyle and even the gender of the subject are apparently altered. When it comes to the multi-attribute manipulations, all three baseline models fail to mix different attributes at once, since the multi-attribute changes induce the problem of visually-perceptible image distortion. By contrast, the two CDNets exhibit a remarkable ability to disentangle facial attributes from identity, that is, all designated attributes are incorporated into the final images in a more natural manner.
It’s worth noting that on the multi-attribute manipulation task, especially for the male face image editing, the CDNet-dCov visually outperforms the CDNet-XCov in terms of preserving the core identity information (e.g., see deformations of the mouth area shown in the last three columns of Fig. 5). We will give a quantitative comparison between them in Section IV-E.
In this experiment, we qualitatively compare the disentanglement strength of all baselines and the CDNet by synthesizing face images with various attribute intensities. The image manipulation method is similar to the Case 2 in Section III-D, and here we take a list of values to orderly modify the corresponding attribute representations. The attribute value range is set to for AE-XCov, for IcGAN, for VAE/GAN, and for CDNet. We conduct this group of experiments on six representative facial attributes, i.e., “Brown hair”, “Pale skin”, “Eyeglasses”, “Smiling”, “Mustache”, and “w/o Eyeglasses”, respectively. From Fig. 6, it is observed that the blurring effect still exists across all synthetic images generated by AE-XCov. The IcGAN performs well to synthesize a set of new faces with target attributes and attribute intensities; however, none of the resulting images can preserve the core object identities very well. For the VAE/GAN, changing one attribute usually causes the deformation of other attributes. One of such examples is that the modifications on “Pale skin” attribute also cause the deformation of hairstyle. Our two CDNet models, by contrast, are competent to generate distinguishable face fantasies across all compared attributes and variation degrees, and meanwhile preserving the core identity. These results illustrate that the learning strategy of representations, as well as the image manipulation method, enables CDNet to control the degree of disentanglement during image editing.
To further analyze the difference between XCov and for disentanglement, we design an evaluation protocol to quantitatively compare the disentanglement strength of the CDNet-XCov and the CDNet-dCov. We summarize the evaluation procedure into the following four steps.
First, divide the training set into two subsets: the first subset consists of images with the designated attribute, the second one not.
Second, train a two-class classifier on the two subsets.
Third, in the test set, select all images that do not contain the designated attribute, then feed them to the disentanglement model to generate their counterparts with the designated attribute and intensity.
Finally, employ the classifier trained in the second step to classify those images synthesized in the third step, which produces a classification error rate as the evaluation index.
The evaluation protocol is built based on a hypothesis, that is, the classifier is well-trained and therefore lower error rate means it is easier for the classifier to perceive the designated attribute in synthesized images. For each attribute, we train a linear SVM as the two-class classifier to perform attribute classification tasks. Each attribute is assigned ten different intensities, arranged from the lowest level to the highest one, and the two disentanglement models act to synthesize images with both the designated attribute and these corresponding attribute intensities. The main evaluation results are illustrated in Fig. 7.
As shown in Fig. 7, for each facial attribute, the classification error rates consistently decrease as increasing attribute intensities. This result implies that by taking higher attribute intensities, the two CDNets can synthesize images with more distinct attributes. In addition, under the same network architecture and parameter settings, the classification performance of CDNet-dCov is comparable and even superior to that of CDNet-XCov, especially when improving the influence of the decorrelation regularization on learning latent representations (i.e., setting ). We attribute this performance gap to the strong disentanglement ability of , as minimizing induces independence between two random variables, rather than the non-correlation as approximated by minimizing XCov. For this reason, in the CDNet-dCov, modifications on the target attributes have less effect on the core identity information contained in the latent representation. This property enables the CDNet-dCov to synthesize images with more concrete and distinguishable attributes, compared with the CDNet-XCov.
To verify the impact of different loss terms of the proposed model, we conduct an ablation study on the reconstruction task and the disentanglement task, respectively. From the reconstruction results in Fig. 8(a), we observe that images generated by the model M1 are quite blurry and only capture the coarse shape of faces; and that images produced by the model M2 show more texture features, but followed with the content deformation in some regions (e.g., hair and mouth). The results demonstrate that minimizing pixel-wise reconstruction loss encourages the model to preserve global identity information, while minimizing feature-wise reconstruction loss is beneficial to restore local detail features. Consequently, models trained with both of these two reconstruction losses perform better in generating more plausible face images, as indicated by reconstruction results of M3, M3, and M3.
For the disentanglement comparison shown in Fig. 8(b), the plain model M3 (trained without using any decorrelation term) is limited to synthesize target facial attributes being consistent with the context. For example, changing one attribute, such as “Smiling” or “Mustache”, also causes the considerable image quality degradation near the nose area. Besides, M3 cannot preserve the hairstyle during image editing, which can be found in the synthesized faces corresponding to attributes “Brown hair”, “Black hair”, “Chubby”, and so on. By contrast, when training the same neural network using the additional decorrelation regularization (i.e., M3 and M3), the designated attributes can be blended in new faces properly and with less effect on other attributes. The results suggest that the decorrelation term indeed facilitates models to learn uncorrelated representations during training, thus approaching controllable image editing at test time. Furthermore, we find the synthesized images of M3 are more visually realistic and coherent than those of M3. This difference in performance is observed especially when comparing the hairstyle of subjects under the “Black hair” and the “Chubby” attributes, or comparing the mouth area of subjects under the “Brown hair” attribute. Our conclusion is that compared with the XCov, the is more helpful for models to separate the information of interest from other portions, resulting in the ability to synthesize pleasant images in an easier-controllable manner.
In this paper, we proposed a simple yet effective model that aims to address two disentanglement-related problems: controlling the disentanglement degrees at image editing time, and balancing the disentanglement strength and the reconstruction quality. A method of combining AE with GAN was designed to improve the visual quality of reconstructed and synthetic images. Besides, a distance covariance based decorrelation regularization was devised to encourage disentanglement, and the soft target representation was explored to control how much a specific attribute is perceivable in the generated image. In addition, we also developed a classification protocol to quantitatively evaluate the disentanglement strength of our model. Experimental results demonstrate that our model is able to generate new digits with various handwriting styles, and also to synthesize novel faces with the desired attributes and the attribute intensities.
The supervised disentanglement learning, as well as the decorrelation regularization used in this work, enables the model to learn target representations effectively. However, acquiring a large amount of labeled training data is usually costly and time consuming. To alleviate this problem, we can consider extending the current model to be suitable for the semi-/weakly-supervised learning scenario. One possible way to approach this goal is, as discussed in
, devising probabilistic models for inductive and transductive semi-supervised learning, which can be further implemented by using the approximate Bayesian inference method. Moreover, exploring the concrete benefits of disentangled representations for downstream tasks is another promising direction in this research filed.
We leverage the theoretical results in  to prove Theorem 1. Overall, we use properties of the distance correlation to connect the sample distance covariance (dCov) to the independence between two random variables. The following one definition and two lemmas correspond to the Definition 3, Theorem 2, and Theorem 3 in , respectively. One can find the complete proofs to the two lemmas therein.
If and , then almost surely
If , then , and if and only if and are independent.
According to the given condition that and , Lemma 1 holds. Then we have
Based on Definition 1, we get
The authors would like to thank the anonymous reviewers for their valuable comments and suggestions.
IEEE Winter Conference on Applications of Computer Vision (WACV), Cited by: §II-A, §II, §III-A3.
Perceptual losses for real-time style transfer and super-resolution. In ECCV, Cited by: §III-A3.
Stochastic backpropagation and approximate inference in deep generative models. In ICML, Cited by: §I, §II-A.