Guided Variational Autoencoder for Disentanglement Learning

04/02/2020 ∙ by Zheng Ding, et al. ∙ 9

We propose an algorithm, guided variational autoencoder (Guided-VAE), that is able to learn a controllable generative model by performing latent representation disentanglement learning. The learning objective is achieved by providing signals to the latent encoding/embedding in VAE without changing its main backbone architecture, hence retaining the desirable properties of the VAE. We design an unsupervised strategy and a supervised strategy in Guided-VAE and observe enhanced modeling and controlling capability over the vanilla VAE. In the unsupervised strategy, we guide the VAE learning by introducing a lightweight decoder that learns latent geometric transformation and principal components; in the supervised strategy, we use an adversarial excitation and inhibition mechanism to encourage the disentanglement of the latent variables. Guided-VAE enjoys its transparency and simplicity for the general representation learning task, as well as disentanglement learning. On a number of experiments for representation learning, improved synthesis/sampling, better disentanglement for classification, and reduced classification errors in meta-learning have been observed.



There are no comments yet.


page 4

page 5

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The resurgence of autoencoders (AE) [34, 6, 21]

is an important component in the rapid development of modern deep learning

[17]. Autoencoders have been widely adopted for modeling signals and images [46, 50]. Its statistical counterpart, the variational autoencoder (VAE) [29], has led to a recent wave of development in generative modeling due to its two-in-one capability, both representation and statistical learning in a single framework. Another exploding direction in generative modeling includes generative adversarial networks (GAN) [18], but GANs focus on the generation process and are not aimed at representation learning (without an encoder at least in its vanilla version).

Compared with classical dimensionality reduction methods like principal component analysis (PCA)

[22, 27] and Laplacian eigenmaps [4]

, VAEs have demonstrated their unprecedented power in modeling high dimensional data of real-world complexity. However, there is still a large room to improve for VAEs to achieve a high quality reconstruction/synthesis. Additionally, it is desirable to make the VAE representation learning more transparent, interpretable, and controllable.

In this paper, we attempt to learn a transparent representation by introducing guidance to the latent variables in a VAE. We design two strategies for our Guided-VAE, an unsupervised version (Fig. 1.a) and a supervised version (Fig. 1.b). The main motivation behind Guided-VAE is to encourage the latent representation to be semantically interpretable, while maintaining the integrity of the basic VAE architecture. Guided-VAE is learned in a multi-task learning fashion. The objective is achieved by taking advantage of the modeling flexibility and the large solution space of the VAE under a lightweight target. Thus the two tasks, learning a good VAE and making the latent variables controllable, become companions rather than conflicts.

In unsupervised Guided-VAE, in addition to the standard VAE backbone, we also explicitly force the latent variables to go through a lightweight encoder that learns a deformable PCA. As seen in Fig. 1.a, two decoders exist, both trying to reconstruct the input data : The main decoder, denoted as , functions regularly as in the standard VAE [29]; the secondary decoder, denoted as , explicitly learns a geometric deformation together with a linear subspace. In supervised Guided-VAE, we introduce a subtask for the VAE by forcing one latent variable to be discriminative (minimizing the classification error) while making the rest of the latent variable to be adversarially discriminative (maximizing the minimal classification error). This subtask is achieved using an adversarial excitation and inhibition formulation. Similar to the unsupervised Guided-VAE, the training process is carried out in an end-to-end multi-task learning manner. The result is a regular generative model that keeps the original VAE properties intact, while having the specified latent variable semantically meaningful and capable of controlling/synthesizing a specific attribute. We apply Guided-VAE to the data modeling and few-shot learning problems and show favorable results on the MNIST, CelebA, CIFAR10 and Omniglot datasets.

The contributions of our work can be summarized as follows:

  • We propose a new generative model disentanglement learning method by introducing latent variable guidance to variational autoencoders (VAE). Both unsupervised and supervised versions of Guided-VAE have been developed.

  • In unsupervised Guided-VAE, we introduce deformable PCA as a subtask to guide the general VAE learning process, making the latent variables interpretable and controllable.

  • In supervised Guided-VAE, we use an adversarial excitation and inhibition mechanism to encourage the disentanglement, informativeness, and controllability of the latent variables.

Guided-VAE can be trained in an end-to-end fashion. It is able to keep the attractive properties of the VAE while significantly improving the controllability of the vanilla VAE. It is applicable to a range of problems for generative modeling and representation learning.

2 Related Work

Related work can be discussed along several directions.

Generative model families such as generative adversarial networks (GAN) [18, 2] and variational autoencoder (VAE) [29] have received a tremendous amount of attention lately. Although GAN produces higher quality synthesis than VAE, GAN is missing the encoder part and hence is not directly suited for representation learning. Here, we focus on disentanglement learning by making VAE more controllable and transparent.

Disentanglement learning [41, 48, 23, 1, 16, 26] recently becomes a popular topic in representation learning. Adversarial training has been adopted in approaches such as [41, 48]. Various methods [44, 28, 37] have imposed constraints/regularizations/supervisions to the latent variables, but these existing approaches often involve an architectural change to the VAE backbone and the additional components in these approaches are not provided as secondary decoder for guiding the main encoder. A closely related work is the -VAE [20] approach in which a balancing term is introduced to control the capacity and the independence prior. -TCVAE [8] further extends -VAE by introducing a total correlation term.

From a different angle, principal component analysis (PCA) family [22, 27, 7] can also be viewed as representation learning. Connections between robust PCA [7] and VAE [29] have been observed [10]. Although being a widely adopted method, PCA nevertheless has limited modeling capability due to its linear subspace assumption. To alleviate the strong requirement for the input data being pre-aligned, RASL [45]

deals with unaligned data by estimating a hidden transformation to each input. Here, we take advantage of the transparency of PCA and the modeling power of VAE by developing a sub-encoder (see Fig.

1.a), deformable PCA, that guides the VAE training process in an integrated end-to-end manner. After training, the sub-encoder can be removed by keeping the main VAE backbone only.

To achieve disentanglement learning in supervised Guided-VAE, we encourage one latent variable to directly correspond to an attribute while making the rest of the variables uncorrelated. This is analogous to the excitation-inhibition mechanism [43, 53] or the explaining-away [52] phenomena. Existing approaches [38, 37]

impose supervision as a conditional model for an image translation task, whereas our supervised Guided-VAE model targets the generic generative modeling task by using an adversarial excitation and inhibition formulation. This is achieved by minimizing the discriminative loss for the desired latent variable while maximizing the minimal classification error for the rest of the variables. Our formulation has a connection to the domain-adversarial neural networks (DANN)

[15], but the two methods differ in purpose and classification formulation. Supervised Guided-VAE is also related to the adversarial autoencoder approach [40], but the two methods differ in the objective, formulation, network structure, and task domain. In [24], the domain invariant variational autoencoders method (DIVA) differs from ours by enforcing disjoint sectors to explain certain attributes.

Our model also has connections to the deeply-supervised nets (DSN) [36]

, where intermediate supervision is added to a standard CNN classifier. There are also approaches

[14, 5] in which latent variables constraints are added, but they have different formulations and objectives than Guided-VAE. Recent efforts in fairness disentanglement learning [9, 47] also bear some similarity, but there is still a large difference in formulation.

3 Guided-VAE Model

In this section, we present the main formulations of our Guided-VAE models. The unsupervised Guided-VAE version is presented first, followed by introduction of the supervised version.

(a) Unsupervised Guided-VAE (b) Supervised Guided-VAE
Figure 1: Model architecture for the proposed Guided-VAE algorithms.

3.1 Vae

Following the standard definition in variational autoencoder (VAE) [29], a set of input data is denoted as where

denotes the number of total input samples. The latent variables are denoted by vector

. The encoder network includes network and variational parameters

that produces variational probability model

. The decoder network is parameterized by to reconstruct sample . The log likelihood estimation is achieved by maximizing the Evidence Lower BOund (ELBO) [29]:


The first term in Eq. (1) corresponds to a reconstruction loss (the first term is the negative of reconstruction loss between input and reconstruction ) under Gaussian parameterization of the output. The second term in Eq. (1) refers to the KL divergence between the variational distribution and the prior distribution . The training process thus tries to optimize:


3.2 Unsupervised Guided-VAE

In our unsupervised Guided-VAE, we introduce a deformable PCA as a secondary decoder to guide the VAE training. An illustration can be seen in Fig. 1.a. This secondary decoder is called . Without loss of generality, we let . decides a deformation/transformation field, e.g. an affine transformation denoted as . determines the content of a sample image for transformation. The PCA model consists of basis . We define a deformable PCA loss as:


where defines a transformation (affine in our experiments) operator decided by and is regarded as the orthogonal loss. A normalization term can be optionally added to force the basis to be unit vectors. We follow the spirit of the PCA optimization and a general formulation for learning PCA can be found in [7].

To keep the simplicity of the method we learn a fixed basis and one can also adopt a probabilistic PCA model [49]. Thus, learning unsupervised Guided-VAE becomes:


The affine matrix described in our transformation follows implementation in



The affine transformation includes translation, scale, rotation and shear operation. We use different latent variables to calculate different parameters in the affine matrix according to the operations we need.

    (a)VAE                    (b) -VAE              (c) CC-VAE              (d) JointVAE        (e) Guided-VAE (Ours)
Figure 2: Latent Variables Traversal on MNIST: Comparison of traversal results from vanilla VAE [29], -VAE [20], -VAE with controlled capacity increase (CC-VAE), JointVAE [12] and our Guided-VAE on the MNIST dataset. and in Guided-VAE are controlled.

3.3 Supervised Guided-VAE

For training data , suppose there exists a total of attributes with ground-truth labels. Let where defines a scalar variable deciding the -th attribute and represents remaining latent variables. Let be the ground-truth label for the -th attribute of sample ; . For each attribute, we use an adversarial excitation and inhibition method with term:


where refers to classifier making a prediction for the -th attribute using the latent variable .

This is an excitation process since we want latent variable to directly correspond to the attribute label.

Next is an inhibition term.


where refers to classifier making a prediction for the -th attribute using the remaining latent variables . is a cross-entropy term for minimizing the classification error in Eq. (7). This is an inhibition process since we want the remaining variables as independent as possible to the attribute label in Eq. (8) below.


Notice in Eq. (8) the minus sign in front of the term for maximization which is an adversarial term to make as uninformative to attribute as possible, by pushing the best possible classifier to be the least discriminative. The formulation of Eq. (8) bears certain similarity to that in domain-adversarial neural networks [15] in which the label classification is minimized with the domain classifier being adversarially maximized. Here, however, we respectively encourage and discourage different parts of the features to make the same type of classification.

4 Experiments

In this section, we first present qualitative and quantitative results demonstrating our proposed unsupervised Guided-VAE (Figure 1a) capable of disentangling latent embedding more favorably than previous disentangle methods [20, 12, 28] on MNIST dataset [35] and 2D shape dataset [42]. We also show that our learned latent representation improves classification performance in a representation learning setting. Next, we extend this idea to a supervised guidance approach in an adversarial excitation and inhibition fashion, where a discriminative objective for certain image properties is given (Figure 1b) on the CelebA dataset [39]

. Further, we show that our method is architecture agnostic, applicable in a variety of scenarios such as image interpolation task on CIFAR 10 dataset

[31] and a few-shot classification task on Omniglot dataset [33].

4.1 Unsupervised Guided-VAE

4.1.1 Qualitative Evaluation

We present qualitative results on the MNIST dataset first by traversing latent variables received affine transformation guiding signal in Figure 2. Here, we applied the Guided-VAE with the bottleneck size of 10 (i.e. the latent variables ). The first latent variable represents the rotation information, and the second latent variable represents the scaling information. The rest of the latent variables represent the content information. Thus, we present the latent variables as .

We compare traversal results of all latent variables on MNIST dataset for vanilla VAE [29], -VAE [20], JointVAE [12] and our Guided-VAE (-VAE, JointVAE results are adopted from [12]). While -VAE cannot generate meaningful disentangled representations for this dataset, even with controlled capacity increased, JointVAE can disentangle class type from continuous factors. Our Guided-VAE disentangles geometry properties rotation angle at and stroke thickness at from the rest content information .

To assess the disentangling ability of Guided-VAE against various baselines, we create a synthetic 2D shape dataset following [42, 20] as a common way to measure the disentanglement properties of unsupervised disentangling methods. The dataset consists 737,280 images of 2D shapes (heart, oval and square) generated from four ground truth independent latent factors: -position information (32 values), -position information (32 values), scale (6 values) and rotation (40 values). This gives us the ability to compare the disentangling performance of different methods with given ground truth factors. We present the latent space traversal results in Figure 3, where the results of -VAE and FactorVAE are taken from [28]. Our Guided-VAE learns the four geometry factors with the first four latent variables where the latent variables . We observe that although all models are able to capture basic geometry factors, the traversal results from Guided-VAE are more obvious with fewer factors changing except the target one.

     -VAE                            FactorVAE
              VAE                        Guided-VAE (Ours)
Figure 3: Comparison of qualitative results on 2D shape. First row: originals. Second row: reconstructions. Remaining rows: reconstructions of latent traversals across each latent dimension. In our results, represents the -position information, represents the -position information, represents the scale information and represents the rotation information.
    Gender                          Smile
Figure 4: Comparison of Traversal Result learned on CelebA: Column 1 shows traversed images from male to female. Column 2 shows traversed images from smiling to no-smiling. The first row is from [20] and we follow its figure generation procedure.

4.1.2 Quantitative Evaluation

We perform two quantitative experiments with strong baselines for disentanglement and representation learning in Table 1 and 2. We observe significant improvement over existing methods in terms of disentanglement measured by Z-Diff score [20], SAP score [32], Factor score [28] in Table 1, and representation transferability based on classification error in Table 2.

Model () Z-Diff SAP Factor
VAE [29] 78.2 0.1696 0.4074
-VAE (=2)[20] 98.1 0.1772 0.5786
FactorVAE (=5) [28] 92.4 0.1770 0.6134
FactorVAE (=35) [28] 98.4 0.2717 0.7100
-TCVAE (=1,=5,=1) [8] 96.8 0.4287 0.6968
Guided-VAE (Ours) 99.2 0.4320 0.6660
Guided--TCVAE (Ours) 96.3 0.4477 0.7294
Table 1: Disentanglement: Z-Diff score, SAP score, and Factor score over unsupervised disentanglement methods on 2D Shapes dataset. [ means higher is better]

All models are trained in the same setting as the experiment shown in Figure 3, and are assessed by three disentangle metrics shown in Table 1

. An improvement in the Z-Diff score and Factor score represents a lower variance of the inferred latent variable for fixed generative factors, whereas our increased SAP score corresponds with a tighter coupling between a single latent dimension and a generative factor. Compare to previous methods, our method is orthogonal (due to using a side objective) to most existing approaches.

-TCVAE [8] improves -VAE [20] based on weighted mini-batches to stochastic training. Our Guided--TCVAE further improves the results in all three disentangle metrics.

VAE [29] 2.92%0.12 3.05%0.42 2.98%0.14
-VAE(=2)[20] 4.69%0.18 5.26%0.22 5.40%0.33
FactorVAE(=5) [28] 6.07%0.05 6.18%0.20 6.35%0.48
-TCVAE (=1,=5,=1) [8] 1.62%0.07 1.24%0.05 1.32%0.09
Guided-VAE (Ours) 1.85%0.08 1.60%0.08 1.49%0.06
Guided--TCVAE (Ours) 1.47%0.12 1.10%0.03 1.31%0.06
Table 2: Representation Learning: Classification error over unsupervised disentanglement methods on MNIST. [ means lower is better]

The 95 % confidence intervals from 5 trials are reported.

(a) Bald (b) Bangs (c) Black Hair
(d) Mouth Slightly Open (e) Receding Hairlines (f) Young

Figure 5: Latent factors learned by Guided-VAE on CelebA: Each image shows the traversal results of Guided-VAE on a single latent variable which is controlled by the lightweight decoder using the corresponding labels as signal.

We further study representation transferability by performing classification tasks on the latent embedding of different generative models. Specifically, for each data point (), we use the pre-trained generative models to obtain the value of latent variable given input image . Here is a -dim vector. We then train a linear classifier on the embedding-label pairs to predict the class of digits. For the Guided-VAE, we disentangle the latent variables into deformation variables and content variables with same dimensions (i.e. ). We compare the classification errors of different models with multiple choices of dimensions of the latent variables in Table 2. In general, VAE [29], -VAE [20], and FactorVAE [28] do not benefit from the increase of the latent dimensions, and -TCVAE [8] shows evidence that its discovered representation is more useful for classification task than existing methods. Our Guide-VAE achieves competitive results compare to -TCVAE, and our Guided--TCVAE can further reduce the classification error to when , which is lower than the baseline VAE.

Moreover, we study the effectiveness of and in Guided-VAE separately to reveal the different properties of the latent subspace. We follow the same classification task procedures described above but use different subsets of latent variables as input features for the classifier . Specifically, we compare results based on the deformation variables , the content variables , and the whole latent variables as the input feature vector. To conduct a fair comparison, we still keep the same dimensions for the deformation variables and the content variables . Table 3 shows that the classification errors on are significantly lower than the ones on , which indicates the success of disentanglement as the content variables should determine the class of digits. In contrast, the deformation variables should be invariant to the class. Besides, when the dimensions of latent variables are higher, the classification errors on increase while the ones on decrease, indicating a better disentanglement between deformation and content with increased latent dimensions.

Guided-VAE 8 8 16 27.1% 3.69% 2.17%
16 16 32 42.07% 1.79% 1.51%
32 32 64 62.94% 1.55% 1.42%

Table 3: Classification on MNIST using different latent variables as features: Classification error over Guided-VAE with different dimensions of latent variables [ means higher is better, means lower is better]

4.2 Supervised Guided-VAE

4.2.1 Qualitative Evaluation

We first present qualitative results on the CelebA dataset [39] by traversing latent variables of attributes shown in Figure 4 and Figure 5. In Figure 4, we compare the traversal results of Guided-VAE with -VAE for two labeled attributes (gender, smile) in the CelebA dataset. The bottleneck size is set to 16 (). We use the first two latent variables to represent the attribute information, and the rest to represent the content information. During evaluation, we choose while keeping the remaining latent variables fixed. Then we obtain a set of images through traversing -th attribute (e.g., smiling to non-smiling) and compare them over -VAE. In Figure 5, we present traversing results on another six attributes.

-VAE performs decently for the controlled attribute change, but the individual in -VAE is not fully entangled or disentangled with the attribute. We observe the traversed images contain several attribute changes at the same time. Different from our Guided-VAE, -VAE cannot specify which latent variables to encode specific attribute information. Guided-VAE, however, is designed to allow defined latent variables to encode any specific attributes. Guided-VAE outperforms -VAE by only traversing the intended factors (smile, gender) without changing other factors (hair color, baldness).

4.2.2 Quantitative Evaluation

We attempt to interpret whether the disentangled attribute variables can control the generated images from the supervised Guided-VAE. We pre-train an external binary classifier for -th attribute on the CelebA training set and then use this classifier to test the generated images from Guided-VAE. Each test includes generated images randomly sampled on all latent variables except for the particular latent variable we decide to control. As Figure 6 shows, we can draw the confidence- curves of the -th attribute where with

as the stride length. For the gender and the smile attributes, it can be seen that the corresponding

is able to enable () and disable () the attribute of the generated image, which shows the controlling ability of the -th attribute by tuning the corresponding latent variable .

Figure 6: Experts (high-performance external classifiers for attribute classification) prediction for being negatives on the generated images. We traverse (gender) and (smile) separately to generate images for the classification test.

4.2.3 Image Interpolation

We further show the disentanglement properties of using supervised Guided-VAE on the CIFAR10 dataset. ALI-VAE borrows the architecture that is defined in [11], where we treat as the encoder and as the decoder. This enables us to optimize an additional reconstruction loss. Based on ALI-VAE, we implement Guided-ALI-VAE (Ours), which adds supervised guidance through excitation and inhibition shown in Figure 1. ALI-VAE and AC-GAN [3] serve as a baseline for this experiment.

To analyze the disentanglement of the latent space, we train each of these models on a subset of the CIFAR10 dataset [31] (Automobile, Truck, Horses) where the class label corresponds to the attribute to be controlled. We use a bottleneck size of 10 for each of these models. We follow the training procedure mentioned in [3] for training the AC-GAN model and the optimization parameters reported in [11] for ALI-VAE and our model. For our Guided-ALI-VAE model, we add supervision through inhibition and excitation on .

Model Automobile-Horse Truck-Automobile
AC-GAN [3] 88.27 81.13
ALI-VAE 91.96 78.92
Guided-ALI-VAE (Ours) 85.43 72.31
Table 4: Image Interpolation: FID score measured for a subset of CIFAR10 [31] with two classes each. [ means lower is better] ALI-VAE is a modification of the architecture defined in [11]

To visualize the disentanglement in our model, we interpolate the corresponding , and of two images sampled from different classes. The interpolation here is computed as a uniformly spaced linear combination of the corresponding vectors. The results in Figure 7 qualitatively show that our model is successfully able to capture complementary features in and . Interpolation in corresponds to changing the object type. Whereas, the interpolation in corresponds to complementary features such as color and pose of the object.

The right column in Figure 7 shows that our model can traverse in to change the object in the image from an automobile to a truck. Whereas a traversal in changes other features such as background and the orientation of the automobile. We replicate the procedure on ALI-VAE and AC-GAN and show that these models are not able to consistently traverse in and in a similar manner. Our model also produces interpolated images in higher quality as shown through the FID scores [19] in Table 4.

Figure 7: Interpolation of images in , and for AC-GAN, ALI-VAE and Guided-ALI-VAE (Ours).

4.3 Few-Shot Learning

Previously, we have shown that Guided-VAE can perform images synthesis and interpolation and form better representation for the classification task. Similarly, we can apply our supervised method to VAE-like models in the few-shot classification. Specifically, we apply our adversarial excitation and inhibition formulation to the Neural Statistician [13] by adding a supervised guidance network after the statistic network. The supervised guidance signal is the label of each input. We also apply the Mixup method [54] in the supervised guidance network. However, we could not reproduce exact reported results in the Neural Statistician, which is also indicated in [30]. For comparison, we mainly consider results from Matching Nets [51] and Bruno [30] shown in Table 5. Yet it cannot outperform Matching Nets, our proposed Guided Neural Statistician reaches comparable performance as Bruno (discriminative), where a discriminative objective is fine-tuned to maximize the likelihood of correct labels.

Model 5-way 20-way
Omniglot 1-shot 5-shot 1-shot 5-shot
Pixels [51] 41.7% 63.2% 26.7% 42.6%
Baseline Classifier [51] 80.0% 95.0% 69.5% 89.1%
Matching Nets [51] 98.1% 98.9% 93.8% 98.5%
Bruno [30] 86.3% 95.6% 69.2% 87.7%
Bruno (discriminative) [30] 97.1% 99.4% 91.3% 97.8%

97.7% 99.4% 91.4% 96.4%
Ours (discriminative) 97.8% 99.4% 92.1% 96.6%
Table 5: Few-shot classification: Classification accuracy for a few-shot learning task on the Omniglot dataset.

5 Ablation Study

5.1 Deformable PCA

In Figure 8, we visualize the sampling results from PCA and . By applying a deformation layer into the PCA-like layer, we show deformable PCA has a more crispy sampling result than vanilla PCA.

Figure 8: (Top) Sampling Result Obtained from PCA (Bottom) Sampling Result obtained from learned deformable PCA (Ours)

5.2 Guided Autoencoder

To further validate our concept of “guidance”, we introduce our lightweight decoder to the standard autoencoder (AE) framework. We conduct MNIST classification tasks using the same setting in Figure 2. As Table 6 shows, our lightweight decoder improves the representation learned in autoencoder framework. Yet a VAE-like structure is indeed not needed if the purpose is just reconstruction and representation learning. However, VAE is of great importance in building generative models. The modeling of the latent space of

with e.g., Gaussian distributions is again important if a probabilistic model is needed to perform novel data synthesis (e.g., the images shown in Figure

4 and Figure 5).

Auto-Encoder (AE) 1.37%0.05 1.06%0.04 1.34%0.04
Guided-AE (Ours) 1.46%0.06 1.00%0.06 1.10%0.08
Table 6: Classification error over AE and Guided-AE on MNIST.

5.3 Geometric Transformations

We conduct an experiment by excluding the geometry-guided part from the unsupervised Guided-VAE. In this way, the lightweight decoder is just a PCA-like decoder but not a deformable PCA. The setting of this experiment is exactly the same as described in Figure 2. The bottleneck size of our model is set to 10 of which the first two latent variables represent the rotation and scaling information separately. As a comparison, we drop off the geometric guidance so that all 10 latent variables are controlled by the PCA-like light decoder. As shown in Figure 9 (a) (b), it can be easily seen that geometry information is hardly encoded into the first two latent variables without a geometry-guided part.

(a) Unsupervised Guided-VAE (b) Unsupervised Guided-VAE
without Geometric Guidance with Geometric Guidance

(c) Supervised Guided-VAE (d) Supervised Guided-VAE
without Inhibition with Inhibition
Figure 9: Ablation study on Unsupervised Guided-VAE and Supervised Guided-VAE

5.4 Adversarial Excitation and Inhibition

We study the effectiveness of adversarial inhibition using the exact same setting described in the supervised Guided-VAE part. As shown in Figure 9 (c) and (d), Guided-VAE without inhibition changes the smiling and sunglasses while traversing the latent variable controlling the gender information. This problem is alleviated by introducing the excitation-inhibition mechanism into Guided-VAE.

6 Conclusion

In this paper, we have presented a new representation learning method, guided variational autoencoder (Guided-VAE), for disentanglement learning. Both unsupervised and supervised versions of Guided-VAE utilize lightweight guidance to the latent variables to achieve better controllability and transparency. Improvements in disentanglement, image traversal, and meta-learning over the competing methods are observed. Guided-VAE maintains the backbone of VAE and it can be applied to other generative modeling applications.
Acknowledgment. This work is funded by NSF IIS-1618477, NSF IIS-1717431, and Qualcomm Inc. ZD is supported by the Tsinghua Academic Fund for Undergraduate Overseas Studies. We thank Kwonjoon Lee, Justin Lazarow, and Jilei Hou for valuable feedbacks.


  • [1] A. Achille and S. Soatto (2018) Emergence of invariance and disentanglement in deep representations.

    The Journal of Machine Learning Research

    19 (1), pp. 1947–1980.
    Cited by: §2.
  • [2] M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein generative adversarial networks. In ICML, Cited by: §2.
  • [3] J. S. Augustus Odena (2017) Conditional image synthesis with auxiliary classifier gans. In ICML, Cited by: §4.2.3, §4.2.3, Table 4.
  • [4] M. Belkin and P. Niyogi (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation 15 (6), pp. 1373–1396. Cited by: §1.
  • [5] P. Bojanowski, A. Joulin, D. Lopez-Paz, and A. Szlam (2018) Optimizing the latent space of generative networks. In ICML, Cited by: §2.
  • [6] H. Bourlard and Y. Kamp (1988)

    Auto-association by multilayer perceptrons and singular value decomposition

    Biological cybernetics 59 (4-5), pp. 291–294. Cited by: §1.
  • [7] E. J. Candès, X. Li, Y. Ma, and J. Wright (2011) Robust principal component analysis?. Journal of the ACM (JACM) 58 (3), pp. 11. Cited by: §2, §3.2.
  • [8] T. Q. Chen, X. Li, R. B. Grosse, and D. K. Duvenaud (2018) Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, Cited by: §2, §4.1.2, §4.1.2, Table 1, Table 2.
  • [9] E. Creager, D. Madras, J. Jacobsen, M. Weis, K. Swersky, T. Pitassi, and R. Zemel (2019) Flexibly fair representation learning by disentanglement. In ICML, Cited by: §2.
  • [10] B. Dai, Y. Wang, J. Aston, G. Hua, and D. Wipf (2018) Connections with robust pca and the role of emergent sparsity in variational autoencoder models. The Journal of Machine Learning Research 19 (1), pp. 1573–1614. Cited by: §2.
  • [11] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville (2017) Adversarially learned inference. In ICLR, Cited by: §4.2.3, §4.2.3, Table 4.
  • [12] E. Dupont (2018) Learning disentangled joint continuous and discrete representations. In Advances in Neural Information Processing Systems, Cited by: Figure 2, §4.1.1, §4.
  • [13] H. Edwards and A. Storkey (2017) Towards a neural statistician. In ICLR, Cited by: §4.3.
  • [14] J. Engel, M. Hoffman, and A. Roberts (2018) Latent constraints: learning to generate conditionally from unconditional generative models. In ICLR, Cited by: §2.
  • [15] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016) Domain-adversarial training of neural networks. The Journal of Machine Learning Research 17 (1), pp. 2096–2030. Cited by: §2, §3.3.
  • [16] A. Gonzalez-Garcia, J. van de Weijer, and Y. Bengio (2018) Image-to-image translation for cross-domain disentanglement. In Advances in Neural Information Processing Systems, Cited by: §2.
  • [17] I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. Vol. 1, MIT Press. Cited by: §1.
  • [18] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, Cited by: §1, §2.
  • [19] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, Cited by: §4.2.3.
  • [20] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2017) Beta-vae: learning basic visual concepts with a constrained variational framework.. In ICLR, Cited by: §2, Figure 2, Figure 4, §4.1.1, §4.1.1, §4.1.2, §4.1.2, §4.1.2, Table 1, Table 2, §4.
  • [21] G. E. Hinton and R. S. Zemel (1994) Autoencoders, minimum description length and helmholtz free energy. In Advances in neural information processing systems, Cited by: §1.
  • [22] H. Hotelling (1933) Analysis of a complex of statistical variables into principal components. Journal of educational psycholog 24. Cited by: §1, §2.
  • [23] Q. Hu, A. Szabó, T. Portenier, P. Favaro, and M. Zwicker (2018) Disentangling factors of variation by mixing them. In CVPR, Cited by: §2.
  • [24] M. Ilse, J. M. Tomczak, C. Louizos, and M. Welling (2019) DIVA: domain invariant variational autoencoders. In ICLR Worshop Track, Cited by: §2.
  • [25] M. Jaderberg, K. Simonyan, A. Zisserman, et al. (2015) Spatial transformer networks. In Advances in neural information processing systems, Cited by: §3.2.
  • [26] A. H. Jha, S. Anand, M. Singh, and V. Veeravasarapu (2018) Disentangling factors of variation with cycle-consistent variational auto-encoders. In ECCV, Cited by: §2.
  • [27] I. Jolliffe (2011) Principal component analysis. Springer Berlin Heidelberg. Cited by: §1, §2.
  • [28] H. Kim and A. Mnih (2018) Disentangling by factorising. In ICML, Cited by: §2, §4.1.1, §4.1.2, §4.1.2, Table 1, Table 2, §4.
  • [29] D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. In ICLR, Cited by: §1, §1, §2, §2, Figure 2, §3.1, §4.1.1, §4.1.2, Table 1, Table 2.
  • [30] I. Korshunova, J. Degrave, F. Huszár, Y. Gal, A. Gretton, and J. Dambre (2018) BRUNO: a deep recurrent model for exchangeable data. In Advances in Neural Information Processing Systems, Cited by: §4.3, Table 5.
  • [31] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §4.2.3, Table 4, §4.
  • [32] A. Kumar, P. Sattigeri, and A. Balakrishnan (2018) Variational inference of disentangled latent concepts from unlabeled observations. In ICLR, Cited by: §4.1.2.
  • [33] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum (2015) Human-level concept learning through probabilistic program induction. Science 350 (6266), pp. 1332–1338. Cited by: §4.
  • [34] Y. LeCun (1987) Modeles connexionnistes de lapprentissage. Ph.D. Thesis, PhD thesis, These de Doctorat, Universite Paris 6. Cited by: §1.
  • [35] Y. LeCun (1998)

    The mnist database of handwritten digits

    http://yann. lecun. com/exdb/mnist/. Cited by: §4.
  • [36] C. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu (2015) Deeply-supervised nets. In Artificial intelligence and statistics, pp. 562–570. Cited by: §2.
  • [37] J. Lin, Z. Chen, Y. Xia, S. Liu, T. Qin, and J. Luo (2019) Exploring explicit domain supervision for latent space disentanglement in unpaired image-to-image translation. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2, §2.
  • [38] Y. Liu, Y. Yeh, T. Fu, S. Wang, W. Chiu, and Y. Frank Wang (2018) Detach and adapt: learning cross-domain disentangled deep representation. In CVPR, Cited by: §2.
  • [39] Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In ICCV, Cited by: §4.2.1, §4.
  • [40] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey (2016) Adversarial autoencoders. In ICLR Workshop Track, Cited by: §2.
  • [41] M. F. Mathieu, J. J. Zhao, J. Zhao, A. Ramesh, P. Sprechmann, and Y. LeCun (2016) Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems, Cited by: §2.
  • [42] L. Matthey, I. Higgins, D. Hassabis, and A. Lerchner (2017) DSprites: disentanglement testing sprites dataset. Note: Cited by: §4.1.1, §4.
  • [43] B. K. Murphy and K. D. Miller (2003) Multiplicative gain changes are induced by excitation or inhibition alone. Journal of Neuroscience 23 (31), pp. 10040–10051. Cited by: §2.
  • [44] X. Peng, X. Yu, K. Sohn, D. N. Metaxas, and M. Chandraker (2017)

    Reconstruction-based disentanglement for pose-invariant face recognition

    In ICCV, Cited by: §2.
  • [45] Y. Peng, A. Ganesh, J. Wright, W. Xu, and Y. Ma (2012) RASL: robust alignment by sparse and low-rank decomposition for linearly correlated images. IEEE transactions on pattern analysis and machine intelligence 34 (11), pp. 2233–2246. Cited by: §2.
  • [46] C. Poultney, S. Chopra, Y. LeCun, et al. (2007)

    Efficient learning of sparse representations with an energy-based model

    In Advances in neural information processing systems, Cited by: §1.
  • [47] J. Song, P. Kalluri, A. Grover, S. Zhao, and S. Ermon (2019) Learning controllable fair representations. In The 22nd International Conference on Artificial Intelligence and Statistics, Cited by: §2.
  • [48] A. Szabó, Q. Hu, T. Portenier, M. Zwicker, and P. Favaro (2018) Challenges in disentangling independent factors of variation. In ICLR Workshop Track, Cited by: §2.
  • [49] M. E. Tipping and C. M. Bishop (1999) Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61 (3), pp. 611–622. Cited by: §3.2.
  • [50] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. Manzagol (2010)

    Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion

    Journal of machine learning research 11 (Dec), pp. 3371–3408. Cited by: §1.
  • [51] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching networks for one shot learning. In Advances in neural information processing systems, Cited by: §4.3, Table 5.
  • [52] M. P. Wellman and M. Henrion (1993) Explaining’explaining away’. IEEE Transactions on Pattern Analysis and Machine Intelligence 15 (3), pp. 287–292. Cited by: §2.
  • [53] O. Yizhar, L. E. Fenno, M. Prigge, F. Schneider, T. J. Davidson, D. J. O’shea, V. S. Sohal, I. Goshen, J. Finkelstein, J. T. Paz, et al. (2011) Neocortical excitation/inhibition balance in information processing and social dysfunction. Nature 477 (7363), pp. 171. Cited by: §2.
  • [54] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2018) Mixup: beyond empirical risk minimization. In ICLR, Cited by: §4.3.