Inference-InfoGAN: Inference Independence via Embedding Orthogonal Basis Expansion

by   Hongxiang Jiang, et al.
Beihang University

Disentanglement learning aims to construct independent and interpretable latent variables in which generative models are a popular strategy. InfoGAN is a classic method via maximizing Mutual Information (MI) to obtain interpretable latent variables mapped to the target space. However, it did not emphasize independent characteristic. To explicitly infer latent variables with inter-independence, we propose a novel GAN-based disentanglement framework via embedding Orthogonal Basis Expansion (OBE) into InfoGAN network (Inference-InfoGAN) in an unsupervised way. Under the OBE module, one set of orthogonal basis can be adaptively found to expand arbitrary data with independence property. To ensure the target-wise interpretable representation, we add a consistence constraint between the expansion coefficients and latent variables on the base of MI maximization. Additionally, we design an alternating optimization step on the consistence constraint and orthogonal requirement updating, so that the training of Inference-InfoGAN can be more convenient. Finally, experiments validate that our proposed OBE module obtains adaptive orthogonal basis, which can express better independent characteristics than fixed basis expression of Discrete Cosine Transform (DCT). To depict the performance in downstream tasks, we compared with the state-of-the-art GAN-based and even VAE-based approaches on different datasets. Our Inference-InfoGAN achieves higher disentanglement score in terms of FactorVAE, Separated Attribute Predictability (SAP), Mutual Information Gap (MIG) and Variation Predictability (VP) metrics without model fine-tuning. All the experimental results illustrate that our method has inter-independence inference ability because of the OBE module, and provides a good trade-off between it and target-wise interpretability of latent variables via jointing the alternating optimization.



page 5

page 7


InfoVAEGAN : learning joint interpretable representations by information maximization and maximum likelihood

Learning disentangled and interpretable representations is an important ...

A latent-observed dissimilarity measure

Quantitatively assessing relationships between latent variables and obse...

Bounded Information Rate Variational Autoencoders

This paper introduces a new member of the family of Variational Autoenco...

Semi-Supervised Disentangled Framework for Transferable Named Entity Recognition

Named entity recognition (NER) for identifying proper nouns in unstructu...

Guiding InfoGAN with Semi-Supervision

In this paper we propose a new semi-supervised GAN architecture (ss-Info...

Disentangling Interpretable Generative Parameters of Random and Real-World Graphs

While a wide range of interpretable generative procedures for graphs exi...

Product of Orthogonal Spheres Parameterization for Disentangled Representation Learning

Learning representations that can disentangle explanatory attributes und...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Unsupervised representation learning is helpful for downstream tasks like simple classification, reinforcement learning, image translation, spatiotemporal disentanglement and zero-shot learning with insufficient data

Bengio et al. (2013); Chen et al. (2016); Donà et al. (2021); Lake et al. (2017); Wu et al. (2019). It aims to extract features which are convenient for subsequent processing from unlabeled data. Particularly, disentangled features are more important because of their better independence and interpretability. Therefore, unsupervised disentanglement learning has attracted increasing attention in recent years.

It is natural to use generative model for unsupervised disentanglement learning, because disentanglement can be reflected via observing the noticeable and distinct properties of generated data with the change of one latent variable. It implicitly models the relationship between data and latent variables. The most mainstream generative models in the field of unsupervised disentanglement learning are based on Variational AutoEncoder (VAE) Kingma and Welling (2014) or Generative Adversarial Network (GAN) Goodfellow et al. (2014).

Disentanglement learning emphasizes the inter-independence and target-wise interpretability of latent variables Eastwood and Williams (2018). Considering that the VAE objective function focuses on these characteristics, VAE-based disentanglement approaches mostly attempt to modify the objective function form, where -VAE Higgins et al. (2017)

is a classic work to adjust the weight of independence with an extra hyperparameter

in the original VAE objective function. Based on it, subsequent improvements are developed in obtaining a Total Correlation (TC) term Watanabe (1960) that better constrains independence from independence term Chen et al. (2018); Jeong and Song (2019); Kim and Mnih (2018)

. To solve the uncertainty of model identifiability caused by lacking inductive bias in the above methods, some researchers combine causal inference or nonlinear Independent Component Analysis (ICA)

Khemakhem et al. (2020); Yang et al. (2021). In general, these methods have strong disentanglement ability, but tend to generate low-quality images. To improve the quality of the generated images, GAN-based approaches are proposed. For example, InfoGAN Chen et al. (2016) generates images correlated with latent variables by maximizing the Mutual Information (MI) between them. Although it can generate clear images, the performance on disentanglement is not enough, because the independence of latent variables is ignored. Then, many approaches try to embed contrastive learning Deng et al. (2020); Lin et al. (2020); Zhu et al. (2020), VAEs Mathieu et al. (2016)

, and Conditional Batch Normalization (CBN)

De Vries et al. (2017); Miyato and Koyama (2018); Nguyen-Phuoc et al. (2019) to constrain independent latent variables, in which InfoGAN-CR performs well Lin et al. (2020), even far exceeding VAE-based counterparts. To control the independence of latent variables, it uses contrastive latent sampling as the introduction of prior knowledge, indirectly exploiting self-supervised information.

In this work, we focus our study on building a GAN-based model with inter-independent inference ability between latent variables. To infer inter-independence on keeping target-wise interpretability, a novel Orthogonal Basis Expansion (OBE) module is proposed and embedded into the GAN-based model, named Inference-InfoGAN, which directly employs self-supervised information of the data without introducing more priors. Our main contributions are as the following three aspects.

  • We introduce an Inference-InfoGAN framework, embedding the designed OBE module into InfoGAN structure. Owing to the OBE, an adaptive orthogonal basis can be obtained to represent the data and infer the inter-independence between latent variables. Following the InfoGAN structure, the target-wise interpretability remains similar to the traditional GAN-based methods.

  • We design an alternating optimization on consistence constrain and orthogonal requirement updating. This step can balance both abilities between the inter-independent inference and target-wise interpretable consistency on training of Inference-InfoGAN.

  • We compare our proposed method with the state-of-the-art GAN-based and VAE-based models, aided by downstream tasks. The disentanglement performance is evaluated on unlabeled datasets by unsupervised disentanglement metrics, besides evaluating on labeled datasets. The experimental results illustrate the higher disentanglement to mine richer self-supervised information without specific supervision information.

Related Work

InfoGAN-based models

InfoGAN is a model combined with GAN and information theory. Based on GAN, a generator and a discriminator

can be trained by the idea of game theory. When the optimal solutions are obtained, the

can map the noise z to the real data distribution . It makes that cannot distinguish synthetic data from real data. The objective function of GAN is described as a minimax game as follow:



represents a standard Gaussian distribution. Eq.

1 can be transformed into minimizing the Jensen-Shannon (JS) divergence of the real data distribution and the generated data distribution . Subsequent research on stabilizing its training and improving its performance tries to introduce Wasserstein Distance in the optimal transmission theory Villani (2009) to replace JS divergence Arjovsky et al. (2017); Gulrajani et al. (2017); Wu et al. (2018). Furthermore, SNGAN Miyato et al. (2018) uses spectral normalization to ensure a Lipschitz constraint which is necessary to optimize the Wasserstein Distance.

Generally, vanilla GAN is a breakthrough work to generate high-quality images, but the generation is random without controlling. To guide the generation of images with specific interpretable characteristics, InfoGAN is developed. It applies GAN to the field of unsupervised disentanglement learning for the first time, and is competitive with supervised models. InfoGAN achieves interpretable feature representation by maximizing the MI between the partial latent variables c and the images . It proposes the following information-regularized minimax game:


This method defines a variational lower bound of the MI:


and uses the neural network to fit the variational distribution so that MI can be estimated and maximized. However, it is insufficient in inter-independence characteristics only maximizing the MI. Recently, some advanced methods try to improve InfoGAN. For example, IB-GAN

Jeon et al. (2018) is inspired by Information Bottleneck (IB) theory Shwartz-Ziv and Tishby (2017) via adding a capacity regularization for MI directly. OOGAN Liu et al. (2020) adds orthogonal regularization with constraining the network weight of . InfoGAN-CR Lin et al. (2020) introduces contrastive learning to provide additional prior knowledge and exploit self-supervised information indirectly.

-VAE-based models

-VAE simply changes the objective function of VAE to:


where z denotes the latent variables that control the generated images x, and are fitted by two neural networks. -VAE maximizes the Evidence Lower-Bound (ELBO) while setting to increase the penalty for the term ( in the vanilla VAE). When is defined as an independent Gaussian prior, the larger , the more tends to an independent distribution, which enhances independent disentangled representation and reduces the quality of image reconstruction.

The following research tries to introduce a variational distribution again, and separates a new TC regularization item from term:


to directly constrains the independence of latent variables. However, because the TC term is intractable, to optimize the objective function, -TCVAE Chen et al. (2018) uses an approximate method of estimating TC, while FactorVAE Kim and Mnih (2018) estimates the distribution via adding a discriminator.


InfoGAN can generate high-quality images from controlled latent variables with less independence, while -VAE can construct high-independence latent variables with lower-quality images. To integrate the goal of high-quality images in InfoGAN and high-independence latent variables in -VAE, we propose a novel independent inference framework based on InfoGAN in this paper.

Following the generative model, real image data

comes from a continuous control vector

and uncontrollable noise , namely there is an optimal mapping that satisfies . A discriminator distinguishes the distribution of real and generated data. To facilitate the search for the optimal mapping stably, the adversarial loss is applied the same as in the WGAN-div:


where comes from a linear combination of real and generated data. Based on these denotations, an unsupervised disentanglement problem can be modelled as each dimension in independently control an interpretable feature of . To promote the expression of independent characteristics, we conduct latent variables inference on the generated image based on its adaptive orthogonal basis which can be obtained from our specially designed OBE module. Additionally, we propose an integrated Inference-MI loss , where is an encoder based on InfoGAN, and

is an adaptively obtained orthogonal matrix to further restrict the inter-independence of

. In order to ensure the orthogonality of , we introduce the method of orthogonal regularization. Orthogonal regularization often appears in the constraints on network weight matrix , which satisfies

. Most approaches construct the loss function based on this. For example, Brock

Brock et al. (2017) achieves goals by minimizing the following formula:


BigGAN Brock et al. (2019) optimize by minimizing the Frobenius norm of the matrix, and SNGAN iteratively calculates the 2-norm of the matrix Miyato et al. (2018)

. In addition, there are methods based on Singular Value Decomposition (SVD) to satisfy orthogonal constraints

Li et al. (2021). To quickly find an orthogonal matrix , similar to Eq .7, we minimize the following new objective function:


Based on the orthogonal matrix , we can perform disentanglement-related inference on the generated image. Generally, for data in a finite dimensional space, it can be expanded as follows:


where , denotes the standard orthogonal basis, denotes the number of the basis. Using the elements in , the orthonormal basis and the result of the expansion of will be explicitly expressed as:

where , . Thus, when and , Hadamard product of and is , otherwise it is , i.e. the standard orthogonal basis has been found, which is . The coefficients corresponding to different basis could have an independent and strong correlation effect on . Therefore, excellent orthogonal matrix , generator , and encoder jointly achieve the following effects: If the controllable latent variables is changed, the specific features of mapped image under can be changed. It is manifested in the MI increasing estimated by between the images and the latent variables, and the coefficient changing represented by the OBE inference on images.

Accordingly, we design the following maximization goal function:


Therefore, maximizing Eq. 10 is to maximize the first mutual information item and minimize a Kullback-Leibler (KL) divergence, which can push the unknown and real conditional distribution to an independent distribution , this distribution can be expressed as following as follows with Eq. Method:

Figure 1: Inference-InfoGAN module. InfoGAN only includes the blue part of the model, Inference-InfoGAN adds the branch below, where represents the parameters in our proposed OBE module, adaptively obtaining orthogonal basis and inferring latent variables with inter-independence.

where is the element in . In order to optimize Eq. 10, we further use the technique of Variational Information Maximization to derive its lower bound, , similar to InfoGAN Barber and Agakov (2003).The detailed proof is shown in Appendix A:


where hyperparameter affects the weight of the two items in , the first one is used to constrain inter-independence and consistence between the OBE-based inference results and latent variables, the second one is used to emphasize target-wise representation. The larger , the greater the penalty for the independence of latent variables. The smaller reflects the more similarity to the vanilla InfoGAN, resulting in poor disentanglement performance. We can simply optimize using Monte Carlo method.

Overall, our total objective function can be composed of the adversarial loss , the Inference-MI loss and the orthogonality loss with non-negative scalar and :


Based on the above method, we build the model in Figure 1.

In fact, only calculates the gradient of the parameters in the OBE module. Other parts optimize all parameters, constraining the interpretability of the generated image and the consistence with the latent variables. In order to balance the training of the orthogonal basis and consistence constrain, we design an alternating optimization step: After updating , , and according to and in each iteration, we fix the parameters in , , , and only repeatedly train according to . When is below an upper bound , proceed to the next iteration.

This method balances the orthogonality of the basis and other constraints. Although the training speed of each epoch is slightly reduced, it is more convenient to find a model that performs well. The pseudocode of the entire model training is given in Appendix B.


Datasets and evaluation metrics

To demonstrate the performance of our method, we conduct some qualitative and quantitative experiments on two traditional datasets, including the dSprites Matthey et al. (2017) dataset with available disentangling annotations and the CelebA Liu et al. (2015)

dataset without the corresponding disentanglement labels. For quantitative evaluation, we adopt three evaluation metrics on the dSprites dataset, namely the FactorVAE, Separated Attribute Predictability (SAP), and Mutual Information Gap (MIG)

Chen et al. (2018); Kim and Mnih (2018); Kumar et al. (2018). On the other hand, inspired by Locatello et al. (2019), for a fair comparison to show that our work has stable and superior performance on unlabeled datasets without specific supervision information for model fine-tuning, two metrics are used for the CelebA dataset, i.e. the Variation Predictability (VP) disentanglement metric Zhu et al. (2020) and the image quality metric Fréchet Inception Distance (FID) Heusel et al. (2017).

In fact, VP score is calculated by predicting which dimension of the variable has changed from a pair of images, which are the results generated by the generator trained by the disentanglement model and only one-dimensional latent variable in the input of

is different. Therefore, VP score reflects the performance in downstream tasks of classifying or cluster different features

Locatello et al. (2020). It should be noted that downstream tasks on the CelebA dataset should not use our model to generate false information to deceive humans, even though it has such a potential possibility.

We use all the datasets and reference codes in accordance with their licenses. The source code of calculating metrics by Locatello et al. Locatello et al. (2019) uses the Apache license, FactorVAE uses the MIT license, the dSprites dataset uses the Apache license, and the CelebA dataset can be used for non-commercial research or educational purposes.

Implementation details

All models in the experiments are optimized with batchsize set to , using Adam optimizer Kingma and Ba (2015) with initial learning rate set to , set to and set to . The baseline model used in our experiments introduces the Wasserstein Distance and zero-mean penalty to InfoGAN, named InfoWGAN-GP (modified). In other words, InfoWGAN-GP (modified) just removes the OBE module in our models. For the dSprites datasets, we set the size of the generated image to , and control the dimension of the latent variables to , the dimension of the noise to . The optimal hyperparameter settings are , . For the CelebA datasets, we set the size of the generated image to , and control the dimension of the latent variables to , the dimension of the noise to . The hyperparameter , others are the same as the optimal settings on the dSprites datasets. For a multi-channel image, after OBE for each channel, there are multiple expansion coefficients on the same basis, so we add a layer of convolution to combine the multiple coefficients into one. When calculating the VP score, all models set the training set ratio to , epoch to , and the highest test result is selected. When calculating the FID, the number of selected images is , which can get a reliable result. All the experiments are carried out on a server with NVIDIA Ti GPUs.

Experiments on dSprites datasets

We compare the proposed method with four traditional VAE-based methods (beta-VAE, FactorVAE, CasVAE and CasVAE-VP) and four GAN-based methods (InfoGAN, InfoWGAN-GP (modified), InfoGAN-CR and InfoGAN-CR (model selection)). In addition, the Inference-InfoGAN (DCT) is also evaluated in this experiment, which alternatively adopts Discrete Cosine Transform (DCT) for inference based on the OBE module (seeing Appendix C).

For qualitative analysis, the latent traversals obtained by our proposed Inference-InfoGAN is shown in Figure 4(a). It can be observed that our method expresses interpretable information sufficiently, i.e. horizontal and vertical directions, rotation, scale and shape.

(a) Latent traversals trained on dSprites
(b) FactorVAE disentanglement score
Figure 4: Experimental results on the dSprites dataset. (a) Only by changing the value of the latent variables in a single dimension, the corresponding generated images will only change in a certain feature, which is also consistent with the dataset and has strong interpretability. (b) The curve shows the FactorVAE score for different methods.
Model FactorVAE SAP MIG
InfoWGAN-GP (modified)
InfoGAN-CR (model selection)
Inference-InfoGAN (DCT)
Inference-InfoGAN (, )
Inference-InfoGAN (, )
Inference-InfoGAN (, )
Best model in InfoGAN-CR (model selection)
Best model (ours) 0.966 0.658
Table 1: Disentanglement scores on dSprites dataset. The range of disentanglement scores is and is the best. To enhance the robustness of the results, we only change the random seed and train our Inference-InfoGAN models in times. On FactorVAE metric, no matter how we choose orthogonal basis in our method, the result is higher than other advanced work based on GAN or VAE. On SAP and MIG metric, our method is slightly better than InfoGAN-CR.

The quantitative disentanglement results are shown in Figure 4(b) (FactorVAE disentanglement score curves) and Table 1, where the results with and are directly obtained from the paper Lin et al. (2020) and Zhu et al. (2020) respectively. Our model is steadily optimized as the training iteration increases, and eventually achieves the best performance on FactorVAE among all the models in Table 1. Specifically, our proposed method enhances by about Factor-VAE, SAP and MIG compared to the baseline InfoWGAN-GP. Furthermore, it is higher than InfoGAN-CR, the state-of-the-art model in recent one year, on all metrics ( on FactorVAE, on SAP, on MIG). Even if InfoGAN-CR employs model selection of ModelCentrality, our FactorVAE score is still higher than it. In addition, our best model is better than its best model of InfoGAN-CR (model selection) on FactorVAE and SAP, only slightly worse on MIG. Therefore, our work is competitive to disentanglement learning of InfoGAN-CR based on GAN. It focuses on improving the independence of latent variables via seeking an adaptive orthogonal basis without adding any model selection technique. In particular, Inference-InfoGAN obtains better results than the DCT version. Thus, it can be observed that the good performance greatly benefits from the proposed OBE module considering interpretability and independence for the disentanglement task and adaptive orthogonal basis which has better ability to express independent characteristics than fixed DCT strategy.

It should be noted that our OBE module, alternating optimization, and the InfoGAN loss are all effective and critical components of the network. We find that when removing the vanilla InfoGAN loss or the OBE module, the disentanglement ability of the model will be greatly reduced. After replacing alternating optimization with one-step training, albeit the training process is more stable and the disentanglement ability evaluated by FactorVAE remains high, model cannot have the same high level of performance on other metrics. The detailed experimental evaluation will be given in the ablation studies section.

Experiments on CelebA datasets

We compare the proposed method with one traditional VAE-based method (FactorVAE) and four GAN-based methods (InfoGAN, InfoWGAN-GP (modified), VPGAN and Inference-InfoGAN (DCT)).

In terms of qualitative evaluation, it can be seen from Figure 5 that our method obtains the latent variables corresponding to the skin color, gender and other characteristics of the generated face images. Figure 4(a) and Figure 5 jointly illustrate that our proposed OBE module could infer target-wise interpretable latent variables via an adaptive orthogonal basis and further represents the unlabeled data, which mines richer information.

In terms of quantitative evaluation, as shown in Table 2, our model achieves the best results in both VP and FID compared with other methods. Specifically, albeit slight improvement over FactorVAE in VP, the proposed Inference-InfoGAN significantly surpass FactorVAE in image FID, which reduces FID from to . This indicates that the images generated by our model have higher quality. Also, comparisons with InfoGAN and InfoWGAN-GP validate the effectiveness of the proposed model, where the OBE module contributes to increments of and in VP and decrements of and in FID respectively. In comparison with VPGAN, the state-of-the-art method of considering VP and FID, our model achieves improvement of in VP and in FID. To summarize, our method generates higher-quility images than VAE-based models and shows the more independent latent variables compared with GAN-based models, i.e. guarantees both image quality and disentanglement ability.

Figure 5: Latent traversals trained on CelebA. For the input -dimensional we select the dimensions with the most obvious disentanglement effects. They can obviously control the skin color, gender and other characteristics of the generated face images.
Figure 6: Correlation Curve of and (On the trained basis)
Model VP FID
InfoWGAN-GP (modified)
Inference-InfoGAN (DCT) 0.757
Inference-InfoGAN 0.761 37.9
Table 2: Disentanglement and quality scores on CelebA dataset. The smaller the quality score FID, the clearer the images generated by the model, the lowest is .

Ablation studies

In this section, we perform experiments on dSprites to study the effects of our proposed OBE module, related hyperparameters, and alternating optimization technique on disentanglement.

OBE module

To analyze the effects of OBE component, we observe the relationship between the latent variables and inference results on dSprites datasets. As shown in Figure 6, one-dimensional after adding the OBE module can control the change of the inference results based on a specific basis, but has less effects on others. In fact, when the basis of DCT is selected, the above independent influence is not obvious enough (seeing Appendix D). Therefore, training an orthogonal basis adaptively achieves better performance. Furthermore, whether it is based on encoder of InfoGAN or adaptive orthogonal matrix of our Inference-InfoGAN to infer the latent variables in InfoGAN, the independence is not exhibited. It shows that the InfoGAN does not emphasize the inter-independence of latent variables, but does not mean that our orthogonal matrix cannot express the independent characteristics. According to the change of the inference results based on the proposed OBE module in our model, it can be seen that the latent variables are inter-independent.

In addition, to further reflect the importance of OBE, we conduct experiments under different hyperparameter , which could control the weight of introduced loss and experimental results has been given in Table 1. It can be seen that a greater will lead to a higher and more stable average disentanglement score. Similar to the theoretical analysis in the method section, the results also illustrate that OBE component has a clear tendency to improve disentanglement. In general, emphasizing the constraints based on the OBE module will help enhance the inter-independence of the latent variables and their interpretability in images , making up for the lack of independent representation of vanilla InfoGAN.

Table 3: Alternating optimization. Alternating optimization brings a slight improvement on all the metrics.

Alternating optimization

Alternating optimization is a type of training technique. In our design, it is firstly adopted to alternately train alone for orthogonal regularization and optimization of the entire network parameters. To explore its effect, we compare the models obtained by alternating optimization and one-step optimization as shown in Table 3. The results show that alternating optimization is beneficial for the training of our model to ensure the orthogonality of P and enhance disentanglement performance.


We presented an approach for unsupervised disentanglement learning via embedding Orthogonal Basis Expansion on images. We prove that our method has a good trade-off between inter-independence inference ability and target-wise interpretability which are consistent with the disentanglement task. Experimentally, our work is better than the-state-of-art researches on different datasets in terms of multiple disentanglement metrics. Our method does not modify the hyperparameters when migrating from a labeled dataset to an unlabeled dataset, but the selection of the hyperparameter is still a key step to further improve the disentanglement performance of the model. Therefore, in future work, we will continue to study unsupervised hyperparameter adjustment and other limitations of our work, such as experiments on more difficult datasets.


A. Derivation of variational lower bound


B. Algorithm

0:  initial parameters , , , for , , , , training iterations and other hyperparameters
1:  for  to  do
2:     Sample , images , noise , latent variables .
5:     Update based on Equation (6).
6:     Compute based on Equation (6) and Equation (12):
7:     Update , , based on .
8:     while  do
9:        Update based on Equation (8).
10:     end while
11:  end for
Algorithm 1 Inference-InfoGAN

C. Choice of orthogonal basis

In our proposed OBE module, different from training an orthogonal matrix

to obtain one set of orthogonal basis, another simple method is to directly select meaningful and interpretable orthogonal basis. Therefore, we also try to use the basis of Discrete Cosine Transform (DCT) on the experiment. According to the Inverse Discrete Fourier Transform, an image

can be expanded as follows:


The Fourier basis is naturally orthogonal, so the above formula can be understood as expanding the image on an orthogonal basis, and can represent the coefficients corresponding to different orthogonal basis:


In order to simplify the calculation, we choose DCT that ignores the imaginary part. And the DCT coefficients are modified to ensure the orthogonality of the basis, specifically as follows:




According to Eq .17, the optimization of the objective function can be completed. Furthermore, we considered the physical meaning of the DCT basis. It has a specific meaning to the image, one part of which represents low-frequency signals, and the other part represents high-frequency signals of the images. We assume that the main information of the image comes from , which corresponds to low-frequency signals, while the latent variables supplement the details of the images corresponding to high-frequency signals. Therefore, we choose the DCT coefficients corresponding to high-frequency and -dimensional vector to calculate when implementing the experiment.

D. Inference independence based on DCT

We follow the method in Appendix C to obtain and perform ablation analysis on the OBE module. The results are shown in Figure 7.

Figure 7: Correlation Curve of and (On the basis of DCT)

Compared with adaptively training one set of orthogonal basis, the independent influence of the latent variables on the OBE inference results obtained by directly selecting the basis of DCT is significantly weaker. This shows that using the basis of DCT to restrain the independence of latent variables and making latent variables control the interpretable representation is a contradictory process. So even if it can be disentangled eventually, changing a single-dimension latent variable will still have an impact on other dimensions. Therefore, our proposed adaptive OBE can express better independent characteristics than fixed basis expression of DCT.


  • M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein generative adversarial networks. In ICML, Cited by: InfoGAN-based models.
  • D. Barber and F. V. Agakov (2003) The im algorithm: a variational approach to information maximization. In NeurIPS, Cited by: Method.
  • Y. Bengio, A. Courville, and P. Vincent (2013) Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35, pp. 1798–1828. Cited by: Introduction.
  • A. Brock, J. Donahue, and K. Simonyan (2019) Large scale gan training for high fidelity natural image synthesis. In ICLR, Cited by: Method.
  • A. Brock, T. Lim, J. M. Ritchie, and N. Weston (2017) Neural photo editing with introspective adversarial networks. In ICLR, Cited by: Method.
  • T. Q. Chen, X. Li, R. B. Grosse, and D. Duvenaud (2018) Isolating sources of disentanglement in variational autoencoders. In NeurIPS, Cited by: Introduction, -VAE-based models, Datasets and evaluation metrics.
  • X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016) InfoGAN: interpretable representation learning by information maximizing generative adversarial nets. In NeurIPS, Cited by: Introduction, Introduction.
  • H. De Vries, F. Strub, J. Mary, H. Larochelle, O. Pietquin, and A. Courville (2017) Modulating early visual processing by language. In NeurIPS, Cited by: Introduction.
  • Y. Deng, J. Yang, D. Chen, F. Wen, and X. Tong (2020) Disentangled and controllable face image generation via 3d imitative-contrastive learning. In CVPR, Cited by: Introduction.
  • J. Donà, J. Franceschi, S. Lamprier, and P. Gallinari (2021) PDE-driven spatiotemporal disentanglement. In ICLR, Cited by: Introduction.
  • C. Eastwood and C. K. I. Williams (2018) A framework for the quantitative evaluation of disentangled representations. In ICLR, Cited by: Introduction.
  • I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio (2014) Generative adversarial nets. In NeurIPS, Cited by: Introduction.
  • I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In NeurIPS, Cited by: InfoGAN-based models.
  • M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) GANs trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, Cited by: Datasets and evaluation metrics.
  • I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2017) Beta-vae: learning basic visual concepts with a constrained variational framework. In ICLR, Cited by: Introduction.
  • I. Jeon, W. Lee, and G. Kim (2018)

    IB-gan: disentangled representation learning with information bottleneck gan

    Cited by: InfoGAN-based models.
  • Y. Jeong and H. O. Song (2019) Learning discrete and continuous factors of data via alternating disentanglement. In ICML, Cited by: Introduction.
  • I. Khemakhem, D. P. Kingma, R. P. Monti, and A. Hyvärinen (2020) Variational autoencoders and nonlinear ica: a unifying framework. In AISTATS, Cited by: Introduction.
  • H. Kim and A. Mnih (2018) Disentangling by factorising. In ICML, Cited by: Introduction, -VAE-based models, Datasets and evaluation metrics.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: Implementation details.
  • D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. ICLR. Cited by: Introduction.
  • A. Kumar, P. Sattigeri, and A. Balakrishnan (2018) Variational inference of disentangled latent concepts from unlabeled observations. In ICLR, Cited by: Datasets and evaluation metrics.
  • B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman (2017) Building machines that learn and think like people. Behavioral and Brain Sciences 40, pp. E253. Cited by: Introduction.
  • S. Li, K. Jia, Y. Wen, T. Liu, and D. Tao (2021) Orthogonal deep neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, pp. 1352–1368. Cited by: Method.
  • Z. Lin, K. K. Thekumparampil, G. C. Fanti, and S. Oh (2020) InfoGAN-cr and modelcentrality: self-supervised model training and selection for disentangling gans. In ICML, Cited by: Introduction, InfoGAN-based models, Experiments on dSprites datasets.
  • B. Liu, Y. Zhu, Z. Fu, G. de Melo, and A. Elgammal (2020) OOGAN: disentangling gan with one-hot sampling and orthogonal regularization. In AAAI, Cited by: InfoGAN-based models.
  • Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In ICCV, Cited by: Datasets and evaluation metrics.
  • F. Locatello, S. Bauer, M. Lucic, G. Rätsch, S. Gelly, B. Schölkopf, and O. Bachem (2019)

    Challenging common assumptions in the unsupervised learning of disentangled representations

    In ICML, Cited by: Datasets and evaluation metrics, Datasets and evaluation metrics.
  • F. Locatello, B. Poole, G. Rätsch, B. Schölkopf, O. Bachem, and M. Tschannen (2020) Weakly-supervised disentanglement without compromises. In ICML, pp. 6348–6359. Cited by: Datasets and evaluation metrics.
  • M. Mathieu, J. J. Zhao, P. Sprechmann, A. Ramesh, and Y. LeCun (2016) Disentangling factors of variation in deep representation using adversarial training. In NeurIPS, Cited by: Introduction.
  • L. Matthey, I. Higgins, D. Hassabis, and A. Lerchner (2017) DSprites: disentanglement testing sprites dataset. Note: Cited by: Datasets and evaluation metrics.
  • T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. In ICLR, Cited by: InfoGAN-based models, Method.
  • T. Miyato and M. Koyama (2018) CGANs with projection discriminator. In ICLR, Cited by: Introduction.
  • T. Nguyen-Phuoc, C. Li, L. Theis, C. Richardt, and Y. Yang (2019) Hologan: unsupervised learning of 3d representations from natural images. In ICCV, Cited by: Introduction.
  • R. Shwartz-Ziv and N. Tishby (2017) Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810. Cited by: InfoGAN-based models.
  • C. Villani (2009) Optimal transport: old and new. Springer. Cited by: InfoGAN-based models.
  • M. S. Watanabe (1960) Information theoretical analysis of multivariate correlation. IBM Journal of Research and Development 4, pp. 66–82. Cited by: Introduction.
  • J. Wu, Z. Huang, J. Thoma, D. Acharya, and L. V. Gool (2018) Wasserstein divergence for gans. In ECCV, Cited by: InfoGAN-based models.
  • W. Wu, K. Cao, C. Li, C. Qian, and C. C. Loy (2019)

    Transgaga: geometry-aware unsupervised image-to-image translation

    In CVPR, Cited by: Introduction.
  • M. Yang, F. Liu, Z. Chen, X. Shen, J. Hao, and J. Wang (2021) CausalVAE: disentangled representation learning via neural structural causal models. In CVPR, Cited by: Introduction.
  • X. Zhu, C. Xu, and D. Tao (2020) Learning disentangled representations with latent variation predictability. In ECCV, Cited by: Introduction, Datasets and evaluation metrics, Experiments on dSprites datasets.