CausalVAE: Structured Causal Disentanglement in Variational Autoencoder

04/18/2020 ∙ by Mengyue Yang, et al. ∙ 12

Learning disentanglement aims at finding a low dimensional representation, which consists of multiple explanatory and generative factors of the observational data. The framework of variational autoencoder is commonly used to disentangle independent factors from observations. However, in real scenarios, the factors with semantic meanings are not necessarily independent. Instead, there might be an underlying causal structure due to physics laws. We thus propose a new VAE based framework named CausalVAE, which includes causal layers to transform independent factors into causal factors that correspond to causally related concepts in data. We analyze the model identifiabitily of CausalVAE, showing that the generative model learned from the observational data recovers the true one up to a certain degree. Experiments are conducted on various datasets, including synthetic datasets consisting of pictures with multiple causally related objects abstracted from physical world, and a benchmark face dataset CelebA. The results show that the causal representations by CausalVAE are semantically interpretable, and lead to better results on downstream tasks. The new framework allows causal intervention, by which we can intervene any causal concepts to generate artificial data.



There are no comments yet.


page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Unsupervised disentangled representation learning is of importance in various applications such as speech, object recognition, natural language processing, recommender systems

(Hsu et al., 2017; Ma et al., 2019; Hsieh et al., 2018)

. The reason is that it would help enhancing the performance of models, i.e. improving the generalizability, robustness against adversarial attacks as well as the explanability, by learning data’s latent representation. One of the most common frameworks for disentangled representation learning is Variational Autoencoders (VAE), a deep generative model trained using backpropagation to disentangle the underlying explanatory factors. To achieve disentangling via VAE, one uses a penalty function to regularize the training of the model by reducing the gap between the distribution of the latent factors and a standard Multivatrate Gaussian. It is expected to recover the latent variables if the observations in real world are generated by countable independent factor. To further enhance the disentangement, a line of methods consider minimizing the mutual information between different latent factors. For example,

Higgins et al. (2017); Burgess et al. (2018)

adjust the hyperparameter to force latent codes to be independent of each other.

Kim and Mnih (2018); Chen et al. (2018) further improve the independent by reducing total correlation.

The theory of disentangled representation learning is still at its early stage. We face problems such as the lack of a formal definition for disentangled representations and identifiability of disentanglement of generic models in unsupervised learning. To fill the gap,

Higgins et al. (2018) proposed a new formalization of alignment between real world and latent space, and it is the first work which gives a formal definition of disentanglement. Locatello et al. (2018) challenged the common settings of state-of-the-arts, arguing that they can not find an identifiable model without inductive bias. Although they do consider the unreasonable aspect of disentanglement tasks, there are still unsolved problems like identifiability and explainability of the independent factors, or learnability of parameters from observations.

Common disentangling methods make a general assumption that the observations of real world are generated by countable independent factors. The recovered independent factors are considered good representations of data. We challenge this assumption, as in many real world situations, meaningful factors are connected with causality.

Figure 1: A swinging pendulum.

Let us consider an example of a swinging pendulum Fig. 1, the direction of the light and the pendulum are causes of the location and length of shadow . We aim at learning deep representations that correspond to the four concepts. Obviously, these concepts are not independent, i.e. the direction of the light and the pendulum determine the location and the length of the shadow. There exists various kinds of causal model which could measure this causal relationship i.e. Linear Structual Equation Models (SEM)(Shimizu et al., 2006). Existing methods for disentangled representation learning like -VAE (Higgins et al., 2017) might not work as they forces the learned latent code to be as independent as possible. We argue the necessity to learn the causal representation as it allows us to intervention. For example, if we manage to learn latent codes corresponding to those four concepts, we can control the shape of the shadow without interrupting the generation of the light and the pendulum. This corresponds to the do-calculus (Pearl, 2009) in causality, where the system operates under the condition that certain variables are controlled by external forces.

In this paper, we develop a causal disentangled representation learning framework that recovers dependent factors by introduce Linear SEM into variation autoencoder framework. We enforce the structure to the learned latent code by designing a loss function that penalizes the deviation of the learned graph to a Directed Acyclic Graph (DAG). In addition, we analyze the identifiablilty of the proposed generative model, to guarantee an the learned disentangled codes are similar with the true one.

To verify the effectiveness of the proposed method, we conduct experiments on the dataset which consists of multiple causally related objects. We demonstrate empirically that The learned factors are with semantic meanings and can be intervened to generate artificial images that do not appear in training data.

We highlight our contributions of this paper as follows:

  • We propose a new framework of generative model to achieve causal disentanglement learning.

  • We develop a theory on identifiability of our generative models, which guarantees that the true generative model is recoverable up to certain degree.

  • Experiments with synthetic and real world images are conducted to show the causal representations learned by proposed method have rich semantics and more effective for downstream tasks.

2 Related Works & Preliminary

In this section, we firstly provide background knowledge on disentangled representation learning, and we shall focus on recent state-of-the-arts using variational autoencoders. We review some recent advance of causality in generative models.

In the rest of the paper, we denote the latent variables by with factorized density where , and the posterior of the latent variables given the observation .

2.1 Disentanglement & Identifiability Problems

Disentanglement is a typical concept towards independent factorial representation of data. The classic method for identifying intrinsic independent factors is ICA (Comon, 1994; Jutten and Karhunen, 2003). Comon (1994) prove model identifiability of ICA in linear case. However, the identifiability of linear ICA model could not be extended to non-linear settings directly. Hyvarinen and Morioka (2016); Brakel and Bengio (2017) proposed a general identidfiability result for nonliear ICA, which links to the ideas of disentanglement under variational autoencoder.

Figure 2: The information flow of CausalVAE. The observation is the input of a encoder, and the encoder generates latent variable , whose prior distribution is assumed to be standard Multivariate Gaussian. Then it is transformed by the causal layer to be causal representation . The are assumed to be with a conditional prior distribution . is taken as the input of the decoder to reconstruct observation .

The disentangled representation learning learns mutually independent latent factors by an encoder-decoder framework. In the process, a standard normal distribution is used as prior of latent code. They use complex neural functions

to approximate parameterized conditional probability

. This framework was extended by various existing works. Those works often introduce new independence constraints on the original loss function, leading to various disentangling metrics. -VAE (Higgins et al., 2017) proposes an adaptation framework which adjusts the weight of KL term to balance between independence of disentangled factors and reconstruction performance. While factor VAE (Chen et al., 2018) proposes a new frame work which focuses solely on the independence of factors.

The aforementioned unsupervised algorithms do not perform well in some situations which content complex dependency among each factors, possibly because of lacking Inductive Bias and identidfiability of the generative model (Locatello et al., 2018).

The identidfiability problem in variational autoencoder are defined as follows: if the parameters learned from data leads to a marginal distribution that equals the true one produced by , i. e.,

, then the joint distribution also matches

. It means that the learned parameters is identidfiability. Khemakhem et al. (2019) prove that the unsupervised variational autoencoder training results in infinite numbers of distinct models inducing the same data distributions, which means that the underlying ground truth is non-identifiable via unsupervised learning. On the contrary, by leveraging a few labels for supervision, one is able to recover the true model (Mathieu et al., 2018; Locatello et al., 2018). Kulkarni et al. (2015); Locatello et al. (2019) use few labels to guide model training to reduce the parameter uncertainty. Khemakhem et al. (2019) gives an identifiability result of variational autoencoder, by utilizing the theory of nonlinear ICA.

2.2 Causal Discovery from Pure Observational Data

We refer to causal representation as the representations that are structured by a causal graph. Discovering the causal graph from pure observational data has attracted large amount of attention in the past decades (Hoyer et al., 2009; Zhang and Hyvarinen, 2012; Shimizu et al., 2006). Pearl (2009) introduce a probabilistic graphical model based framework to learn causality from data. Shimizu et al. (2006)

proposed an effective method called LiNGAM to learn the causal graph and they proved that the model is fully identifiable under the assumption that the causal relationship is linear and the noise is non-Gaussian distributed.

Zheng et al. (2018) introduces DAG constraints for graph learning under continuous optimization (NOTEARS). Zhu and Chen (2019); Ng et al. (2019) use autoencoder framework to learn causal graph from data. Suter et al. (2018) use causality theories to explain disentangled latent representations. Furthermore, Zhang and Hyvarinen (2012) use more complex hypothesis function to represent a more sophisticated cause-effect relationships between two entities.

3 Method

In this section, we present our method by starting with a new definition of latent representation, and then give a framework of disentanglement using supervision. At last, we give theoretical analysis of the model identifiability.

3.1 Causal Model

To formalize causal representation framework, we consider concepts in real world which have specific physical meanings. The concepts in observations are causally mixed by the causal relationship causal graph *elaborate it in introduction.

As we mentioned, meaningful concepts are mostly not independent factors. We thus introduce causal representation in this paper. The causal representation is a latent data representation with a joint distribution that can be described by a probabilistic graphical model, specially a Directed Acyclic Graph (DAG). We consider linear models in the paper, i.e. Linear Structural Equation models (SEM) on latent factors as:


where is structural representation of concepts. The independent noise are assumed to be Multivariate Gaussian. Once we are able to learn the causal representations from data, we are able to do intervention to the latent codes to generate artificial data which does not appear in the training data.

3.2 Generative Model

Our model is under the framework of VAE-based disentanglement. In additional to the encoder and the decoder structures, we introduce a causal layer to learn causal representations. The causal layer exactly implements a Linear SEM as described in Eq. 1, where is the parameters to learn in this layer.

Unsupervised learning of the model might be infeasible due to the identifiability issue discussed in (Locatello et al., 2018). As a result, the learnability of the causal layer is in question, and predefined casual representation is not identifiable. To address this issue, similar to iVAE (Khemakhem et al., 2019), we use the additional information associated with the true causal concepts as supervising signals. The additional observations must include the information of real concepts like the label, pixel level observations. We build a causal conditional generative framework which uses the additional observations from causal concepts. We will discuss the identifiability of models given additional observations later.

We follow similar definition and notation to iVAE (Khemakhem et al., 2019). Denote by the observed variables and the additional information. corresponds to the -th concept in real causal system. Let be the latent substantive variables with semantics and be the latent independent variables where . For simplicity, we denote .

We now clarify the model assumptions for generation and inference process. Note that we regard both and as the latent variables. Consider the following conditional generative model parameterized by :


Let denotes the decoder which is assumed to be an invertible function and denote the encoder. Let be independent noise variables, and as the latent codes of concepts.

We define the generation and inference process as follows:


which is obtained by assuming the following decoding and encoding equations



are the vectors of independent noise with probability density

. When is infinitesimal, the encoder and decoder distributions can be regarded as deterministic ones.

We define the joint prior for latent variables and as


where and the prior of latent substantive variables is a factorized Gaussian distribution conditioning on the additional observation , i.e.



is an arbitrary function (approximated by a neural network). In this paper, since each causal representation depend on the value of their parents node. We consider the case

where denotes the parents node of

. The distribution has two sufficient statistics, the mean and variance of

, which are denoted by .

3.3 Training Method

We apply variational Bayes to learn a tractable distribution to approximate the true posterior . Given data set , we obtain empirical data distribution . The parameters and are learned by optimizing the following evidence lower bound (ELBO) on the expected data log-likelihood :


where denotes KL divergence.

Noticing the one-to-one correspondence between and , we simplify the variational posterior as follows:

Further according to the model assumptions introduced in Section 3.2, i.e., generation process (4) and prior (7), the ELBO can be rewritten as:

ELBO (10)

where the third term is the key to disentangling the latent codes.

The causal adjacency matrix is constrained to be a DAG. We introduce the acyclicity constraint. Instead of using traditional DAG constraint that is combinatorial, we adopt a continuous constraint function (Zheng et al., 2018; Zhu and Chen, 2019; Ng et al., 2019; Yu et al., 2019) . The function achieves 0 if and only if the adjacency matrix are directed acyclic graph (Yu et al., 2019).


The decoder (generator) uses latent concept representation for reconstruction. To make learning process more smooth, we add the square term to the constraint. Thus the optimization of ELBO should be constrained by Eq. 11:

subject to

By lagrangian multiplier method, we have the new loss function


where denotes regularization hyperparameters.

4 Identifiability Analysis

In this section, we present the identifiability of our proposed model. We adopt the -identifiability (Khemakhem et al., 2019) as follows:

Definition 1. Let be the binary relation on defined as follows:



is an invertible matrix and

is an invertible diagonal matrix in which each elements on diagonal correspond to . we say that the model parameter is -identifiable.

By extending Theorem 1 in iVAE (Khemakhem et al., 2019), we obtain the identifiability theory of our causal generative model.

Theorem 1. Assume that the data we observed are generated according Eq. 3-4 and the following assumptions hold,

  1. The set has measure zero, where

    is the characteristic function of the density

    defined in Eq. 5.

  2. The Jacobian matrix of decoder function and encoder function are full rank.

  3. The sufficient statistics almost everywhere for all and , where is the th statistic of variable .

  4. The additional observations

. Then the parameters are -identifiable.

Sketch of proof:

Step 1: We analyze the identifiability of started by . Then we define a new invertible matrix which contains additional observation in causal system, and use it to prove that the learned is the transformation of .

Step 2: We analyze the identifiability of by replacing in step 1 with . Then we use the invertible matrix , a diagonal matrix containing to finish the proof.

More details are in Appendix.

The parameters of true generative model are unknown during the learning process. The identifiablity of generative model is given by Theorem 1 which guarantees the parameters learned by hypothetical functions are in identifiable family.

In addition, all in align to the additional observation of concept and they are expected to inherent the causal relationship of causal system. That is why that it could guarantee that the are causal representations.

Then, for the causal representation learned by the causal layer parameterized by , we here analyze the indentifiablity of .

Let denote true causal structure of and denote the matrix leanred by our model. The following corollary illustrates the non-dentifiable .

Corollary 1. Suppose and are the true adjacency matrix and the adjacency matrix learned by our model, respectively. Then the following statement holds:


Or equivalently, the exists an invertible matrix such that


Intuitively, the learned in causal layer produces the

, which recovers the true one up to linear transformation.

Figure 3: Architectures of the two decoders used in experiments. (a) presents a structure that each concept is decoded separately by one network, and their results are assembled to be final output. (b) presents a structure that concepts are decoded by single neural network.

We furthur discuss some intuitions of idetifiability. Existing works often learn latent representation in an unsupervised way. However, our method uses the supervised ways, including additional observations. This supervision brings benefit that we can get the identifiability result of model.

The identifiability of the model under supervision of additional observation is obtained by the conditional prior generated from . The conditional prior guarantees that the sufficient statistics of are related to the value of . In other words, the values of are determined by the supervision signal.

5 Experiments

In this section, we present the experimental results of our proposed method CausalVAE on datasets. Compared with those learned by the state-of-the-arts, the representation learned by our method performs well in both the synthetic causal image dataset and real world face data CelebA.

We test our CausalVAE on two tasks. The first task is factor interventions, and the second is downstream tasks, namely image classification.

In our experiments, the structure of the decoder largely influences the results. Thus, we use two designed decoders. The first one decodes the concepts separately and sum them up as the final output, and the second one decodes all concepts using a single neural network. The structures are in Fig. 3.

(a) CausalVAE-a
(b) CausalVAE-b
(c) DC-IGN-a
(d) DC-IGN-b
Figure 4:

The results of DO-experiments on pendulum dataset. The first row presents the result of controlling the pendulum angles and the remaining rows are the results obtained by controlling light angle, shadow length, shadow location respectively. The bottom row is the true input image. Training epoch for models is set to be 100.

Figure 5: Causal graphs of three dataset. (a) shows the causal graph in pendulum dataset. The concepts are pendulum angle, light angle, shadow location and shadow length. (b) shows the causal graph in water dataset, on concepts water height and ball size. (c) shows the causal graph in CelebA, on concepts age, gender and beard.
(a) CausalVAE-a
(b) CausalVAE-b
(c) DC-IGN-a
(d) DC-IGN-b
Figure 6: The results of DO-experiments on water. For each experiment we randomly choose 4 results. The first row presents results of controlling ball size (cause) and the second row controls water height (effect). The bottom one is the ground truth. Training epoch for models is set to be 100.

5.1 Dataset

5.1.1 Synthetic Data

We do experiments on the scenarios containing causally structured entities or concepts. We run models on a synthetic dataset, which include images consisting of causally related objects. A data generator is used to produce the images as model inputs. We will release our data generator soon.

Pendulum: We generate images with 3 entities (pendulum, light, shadow) which include 4 concepts (pendulum angle, light angle, shadow location, shadow length). The picture includes a pendulum. The angles of pendulum and the light are changing overtime. We use the projection laws to generate the shadows. The shadow are influenced by the light and angle of the pendulums. The causal graph of concepts is showed in Fig. 5 (a). In our experiments, we generate about 7k images (6K for training and 1k for inference), the angle of light and pendulum are ranged in around .

(a) Age
(b) Gender
(c) Beard
Figure 7: Results of CausalVAE on CelebA, are results under hyperparameters . The controlled factors from top to bottom line are age, gender and beard, respectively. The first row shows the result of controlling gender, and the second row shows that of controlling age. The bottom is the result of controlling beard.

Water: We produce artificial images, consisting of a ball in a cup filled with water. There are 2 concepts (ball size, height of water bar). The height is effect of the ball size. The causal graph is ploted in Fig. 5

(b) and the dataset includes 7k images, 6k images for training the disentanglemet model and the classifier model, and the rest of dataset are used as the test data of classifier.

5.1.2 Banchmark Dataset

In real world systems, cause and effect relationships commonly exist. To test our proposed method in these kinds of scenarios, we choose a banchmark CelebA111

, which is widely used in computer vision tasks. In this dataset, there are in total 200k human images with labels on different concepts. We focus on 3 concepts (age, gender and beard) on human faces in this dataset.

5.2 Baselines

CausalVAE-unsup: CausalVAE-unsup is the method under unsupervised setting. The architecture of the model is the same as CausalVAE but the additional observations are not used. We adjust the loss function by removing the additional observation.

-VAE: -VAE is a common baseline for unsupervised disentanglement works. The dimensions of the latent representation are the same as that used in CausalVAE. The Standard Multivariate Gaussian distribution is adopted as the prior of latent variables.

DC-IGN: This baseline model is the model under supervised setting. They generate priors of latent variables conditional on the labels. As the case of -VAE, dimensions of latent variables are set in line with our method.

5.3 Intervention experiments

Intervention experiments aim at testing if certain dimension of the latent codes has understandable semantic meanings. We control the value of latent vector by do-calculus operation introduced before, and check the reconstructed images.

For the experiments, all images of the dataset are used to train our proposed model CausalVAE and other baselines.

5.3.1 Synthetic

For the experiments on synthetic dataset, we use different latent variable dimensions. We use 4 and 2 concepts on pendulum and water dataset, respectively. Then in all the experiments, we set the hyperparameter .

We use CausalVAE-a to represent the CausalVAE model with decoder (a), and CausalVAE-b to represent the CausalVAE model with decoder (b). The same rules apply to the DC-IGN model.

We intervened 4 concepts of pendulum, and the results are showed in Fig. 4. The intervention strategy are illustrated in following step: 1) we learned a CausalVAE model; 2) we put a pendulum image into encoder and get the latent code . 3) we change the value of as 0. For example, when we want to intervene gender, we will change the value of directly as 0 and keep other unchanged. 4) we put the total changed latent code into decoder and got reconstruct image.

In implementation of CausalVAE, similar to -VAE, we adjust the KL term in ELBO by multiplying a beta:

The hyperparameters of CausalVAE , .

Since we set the latent value as constant 0, if we controlled concept successfully, the pattern of controlled concept in one image will be the same as other images in its line. For example, when we control pendulum angel in 4(a), the first line shows that the pendulum angle in each images are almost same. And the same with light angle, each lights in different images of the second line are in the middle of top of images. And other concepts in line 3 and line 4 show similar effect.

From the results of CausalVAE with decoder (a) showed by Fig. 4(a), we find that the when we control the angle of light and pendulum, the location and length of shadows change correspondingly. But controlling the shadow factors, the light and pendulum are not affected. This result does not appear when we use decoder (b).

Identifying cause labels Identifying effect labels
Model pendulum pendulum water water pendulum pendulum water water
decoder(a) decoder(b) decoder(a) decoder(b) decoder(a) decoder(b) decoder(a) decoder(b)
-VAE 0.6801 0.1905 0.7867 0.7707 0.6685 0.6679 0.7629 0.7629
DC-IGN 0.8313 0.7634 0.8570 0.8662 0.7649 0.8626 0.7710 0.7972
CausalVAE-unsup 0.8039 0.8028 0.8667 0.8496 0.9362 0.6663 0.7924 0.7990
CausalVAE 0.8658 0.8587 0.8564 0.8656 0.8952 0.8874 0.8032 0.8038
Table 1: The accuracy of classifiers on test dataset. The training epoach is 300 for pendulum and 50 for water. Experiments are repeated 5 times, and the median are reported.

For experiments using decoder (b), controlling the two causes (pendulum angle and light angle), the two effects (shadow length and shadow location) do not change the reconstructed images in an expected way. In addition, controlling the effects factors in the latent representation does not influence the reconstructed images. The reason is that the decoder (b) itself may be an physical model which reasons out the effect factors based on the cause factors. The information contained in effect factors is hence not useful.

Then we analyze the results of DC-IGN. The intervention results are showed in Fig. 4(c) (d). Results show that there exists a problem that the control of causes sometimes does not influence the effects. This is because they do not have a causal layer to model the factors so that the learned factors are not concepts we expect.

We also test CausalVAE on water dataset. This scenario has two concepts. The intervention on the ball size (cause) influences the water height (effect), but the intervention on the effect does not influence the causes. We also find that the results have some fluctuations. The control of the concepts is not as good as that in the pendulum experiments. It is possibly because two concepts are related by a bijective function (one-to-one mapping), and it brings difficulty for the model to understand casual relations between concepts.

In water experiments, we also find that the decoder (a) performs better than decoder (b). We do not use the unsupervised method in these experiments because it will not guarantee all the representations are aligned to the concepts well.

5.3.2 Human Face

We also executed the experiments on real world banchmark data CelebA. In this kind of scenarios, the causal system is often complex, which has heterogeneous causes and effects. It is hard to observe all the concepts in the causal systems. In this experiments, we focus on only 3 concepts (age, gender and beard). Other concepts will possibly be confounders in system. Decoder (a) is used in our experiments.

We conducted our intervention experiments by following step: 1) we learned a CausalVAE model; 2) we put a human picture into encoder and get the latent code . 3) we change the value of from -0.5 to 0.5, in which each are correspond to the concept respectively. For example, when we want to intervene gender, we will change the value of directly from -0.5 to 0.5 and keep other unchanged. 4) we put the total changed latent code into decoder and got reconstruct picture.

Different with synthetic data, we did not change the value of latent code as constant 0 but set the value in a range of number. Thus the figures will show the concept changing clearly.

The Fig. 7 demonstrate the result of CausalVAE under the parameters , . And (a)(b)(c) show the intervention experiments on concepts of age, gender and beard respectively. The interventions perform well that when we intervened the cause concept gender, not only the appearance of gender but the beard changed. In contrast, when we intervened effect concept beard, the gender in figure Fig. 7(c) are not changed.

5.4 Downstream Task

We also use the representation to do the downstream task on synthetic data. In this paper, we conduct tasks of image classification. We use the latent causal representation as the input of the classifier, and do experiments on predictions of causes and effects.

The 80% of dataset are used as the training data and the remaining are for testing. The cause conceptual vectors learned by our model are the inputs of a classifier, to predict either the cause labels or effect labels. The cause labels on pendulum dataset are produced by equally partitioning the angles 0 to 90 degree into 6 classes. The effect labels are constructed by dividing the original additional observations associated with the concepts into 3 classes. In water dataset, classifications on cause label and effect label are all binary classifications. The results are showed in table 1. It shows that using the latent codes learned by CausalVAE and DC-IGN, in general, leads to better classification performance than using that learned by other baselines. Our proposed method achieves the best performance. The choice of decoder does not have significant influences on the results when our model is used. However, it has a clear influence on the results of unsupervised baseline models like CausalVAE-unsup.

6 Conclusion

In this paper, we propose a framework for latent representation learning. We argue that causal representation is good representation for machine learning tasks, and incorporate a causal layer to learn this representation under the framework of variational autoencoder. We give identifiability result of the model when additional observations are available for supervised learning. The method is tested on synthetic and real datasets, on both intervention experiments and downstream tasks. Our viewpoint is expected to bring new insights into the domain of representation learning.


  • P. Brakel and Y. Bengio (2017) Learning independent features with adversarial nets for non-linear ica. arXiv preprint arXiv:1710.05050. Cited by: §2.1.
  • C. P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters, G. Desjardins, and A. Lerchner (2018) Understanding disentangling in -vae. arXiv preprint arXiv:1804.03599. Cited by: §1.
  • T. Q. Chen, X. Li, R. B. Grosse, and D. K. Duvenaud (2018) Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pp. 2610–2620. Cited by: §1, §2.1.
  • P. Comon (1994) Independent component analysis, a new concept?. Signal processing 36 (3), pp. 287–314. Cited by: §2.1.
  • I. Higgins, D. Amos, D. Pfau, S. Racaniere, L. Matthey, D. Rezende, and A. Lerchner (2018) Towards a definition of disentangled representations. arXiv preprint arXiv:1812.02230. Cited by: §1.
  • I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2017) Beta-vae: learning basic visual concepts with a constrained variational framework.. Iclr 2 (5), pp. 6. Cited by: §1, §1, §2.1.
  • P. O. Hoyer, D. Janzing, J. M. Mooij, J. Peters, and B. Schölkopf (2009) Nonlinear causal discovery with additive noise models. In Advances in neural information processing systems, pp. 689–696. Cited by: §2.2.
  • J. Hsieh, B. Liu, D. Huang, L. F. Fei-Fei, and J. C. Niebles (2018) Learning to decompose and disentangle representations for video prediction. In Advances in Neural Information Processing Systems, pp. 517–526. Cited by: §1.
  • W. Hsu, Y. Zhang, and J. Glass (2017) Unsupervised learning of disentangled and interpretable representations from sequential data. In Advances in neural information processing systems, pp. 1878–1889. Cited by: §1.
  • A. Hyvarinen and H. Morioka (2016)

    Unsupervised feature extraction by time-contrastive learning and nonlinear ica

    In Advances in Neural Information Processing Systems, pp. 3765–3773. Cited by: §2.1.
  • C. Jutten and J. Karhunen (2003) Advances in nonlinear blind source separation. In Proc. of the 4th Int. Symp. on Independent Component Analysis and Blind Signal Separation (ICA2003), pp. 245–256. Cited by: §2.1.
  • I. Khemakhem, D. P. Kingma, and A. Hyvärinen (2019) Variational autoencoders and nonlinear ICA: A unifying framework. CoRR abs/1907.04809. External Links: Link, 1907.04809 Cited by: Appendix A, Appendix A, §2.1, §3.2, §3.2, §4, §4.
  • H. Kim and A. Mnih (2018) Disentangling by factorising. arXiv preprint arXiv:1802.05983. Cited by: §1.
  • T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum (2015) Deep convolutional inverse graphics network. In Advances in neural information processing systems, pp. 2539–2547. Cited by: §2.1.
  • F. Locatello, S. Bauer, M. Lucic, G. Rätsch, S. Gelly, B. Schölkopf, and O. Bachem (2018) Challenging common assumptions in the unsupervised learning of disentangled representations. arXiv preprint arXiv:1811.12359. Cited by: §1, §2.1, §2.1, §3.2.
  • F. Locatello, M. Tschannen, S. Bauer, G. Rätsch, B. Schölkopf, and O. Bachem (2019) Disentangling factors of variation using few labels. arXiv preprint arXiv:1905.01258. Cited by: §2.1.
  • J. Ma, C. Zhou, P. Cui, H. Yang, and W. Zhu (2019) Learning disentangled representations for recommendation. In Advances in Neural Information Processing Systems, pp. 5712–5723. Cited by: §1.
  • E. Mathieu, T. Rainforth, N. Siddharth, and Y. W. Teh (2018) Disentangling disentanglement in variational autoencoders. arXiv preprint arXiv:1812.02833. Cited by: §2.1.
  • I. Ng, S. Zhu, Z. Chen, and Z. Fang (2019) A graph autoencoder approach to causal structure learning. CoRR abs/1911.07420. External Links: Link, 1911.07420 Cited by: §2.2, §3.3.
  • J. Pearl (2009) Causality. Cambridge university press. Cited by: §1, §2.2.
  • S. Shimizu, P. O. Hoyer, A. Hyvärinen, and A. Kerminen (2006) A linear non-gaussian acyclic model for causal discovery. Journal of Machine Learning Research 7 (Oct), pp. 2003–2030. Cited by: §1, §2.2.
  • P. Sorrenson, C. Rother, and U. Köthe (2020) Disentanglement by nonlinear ica with general incompressible-flow networks (gin). arXiv preprint arXiv:2001.04872. Cited by: Appendix A.
  • R. Suter, D. Miladinović, B. Schölkopf, and S. Bauer (2018) Robustly disentangled causal mechanisms: validating deep representations for interventional robustness. arXiv preprint arXiv:1811.00007. Cited by: §2.2.
  • Y. Yu, J. Chen, T. Gao, and M. Yu (2019) Dag-gnn: dag structure learning with graph neural networks. arXiv preprint arXiv:1904.10098. Cited by: Appendix B, §3.3.
  • K. Zhang and A. Hyvarinen (2012) On the identifiability of the post-nonlinear causal model. arXiv preprint arXiv:1205.2599. Cited by: §2.2.
  • X. Zheng, B. Aragam, P. K. Ravikumar, and E. P. Xing (2018) DAGs with no tears: continuous optimization for structure learning. In Advances in Neural Information Processing Systems, pp. 9472–9483. Cited by: §2.2, §3.3.
  • S. Zhu and Z. Chen (2019)

    Causal discovery with reinforcement learning

    CoRR abs/1906.04477. External Links: Link, 1906.04477 Cited by: §2.2, §3.3.

Appendix A Proof of Theorem 1

Based on information flow of the model, we would analyze the identifiability of and . The general logic of the proofing follows (Khemakhem et al., 2019).

Step 1: Identifiability of .

Assume that is equals to . For all the observational pairs , let denote the Jacobian matrix of the encoder function. There exist following equations,


where . In determining function and , there exist a Gaussian distribution which has infinitesimal variance. Then, the can be written as . As the assumption (1) holds, this term is vanished. Then in our method, there exists the following equation:


In Gaussian distribution, can be written as follow:


where is the concept index.

Adopting the definition of multivariate Gaussian distribution, we define


There exists the following equations:


where denotes the base measure. In Gaussian distribution, it is .

In learning process, is restricted as DAG. Thus, the exists which is full rank matrix. The item which is not related to in Eq. 21 are cancelled out (Sorrenson et al., 2020).


where denote the index of sufficient statistics of Gaussian distributions, indexing the mean (1) and the variance (2).

By assuming that the additional observation is different, it is guaranteed that coefficients of the observations for different concepts are distinct. Thus, there exists an invertible matrix corresponding to additional information :


Since the assumption that holds, is invertible and full rank diagonal matrix. We have:


where is invertible matrix which corresponds to and . The definition of on learning model migrates the definition of on ground truth.

Then we adopt the definitions following (Khemakhem et al., 2019). According to the Lemma 3 in (Khemakhem et al., 2019), we are able to pick out a pair such that, are linearly independent. Then concat the two points into a vector, and denote the Jacobian matrix , and define on in the same manner. By differentiating Eq. 24, we get


Since the assumptiom (2) that Jacobian of is full rank holds, it can prove that both and are invertible matrix. Thus from Eq. 25, is invertible matrix. The details are shown in (Khemakhem et al., 2019).

Step 2: Under the assumption in Theorem 1, replace the with in Eq. 17, then


Then use Eq. 23 to replace the matrix in Eq. 26, and we get:




Using the same way as shown in Eq. 25, it can prove that is invertible matrix.

Eq. 24 and Eq. 28 both hold. Combining the two results supports the identifiability result in CausalVAE.

Appendix B Implementation Details

We use one NVIDIA Tesla P40 GPU as our train and inference device.

For the implementation of CausalVAE and other baselines, we extend to matrix where is the number of concepts and is the latent dimension of each . The corresponding prior or conditional prior distributions of CausalVAE and other baselines are also adjusted (this means that we extend the multivariate Gaussian to the matrix Gaussian).

The subdiemnsions for each synthetic (pendulum, water) experiments are set to be 4, and 16 for CelebA experiments. The implementation of continuous DAG constraint follows the code of (Yu et al., 2019) 222

b.1 DO-Experiments

In DO-experiments, we train the model on synthetic data for 100 epochs, on CelebA for 500 epochs and use this model to generate latent code of representations.

b.1.1 Synthetic

We present the experiments of our proposed CausalVAE with two kinds of decoder, and experiments of other baselines with decoder (a). The hyperparameters are defined as:

  1. CausalVAE :

  2. CausalVAE-unsup :

  3. DC-IGN :

  4. -VAE :

From the figure, we find that the reconstruct errors of models with decoder (a) are higher than those with decoder (b).

The details of the neural networks are shown in Table 2.

b.1.2 CelebA

The reconstruction errors during the training are shown in Fig.LABEL:dores (c). We only present the experiments with decode (a). The hyperparameters are:

  1. CausalVAE :

  2. CausalVAE-unsup :

  3. DC-IGN :

  4. -VAE :

The details of the neural networks are shown in Table 3.

We also present the DO-experiments of CausalVAE and DC-IGN. In the training of the models, we both use face labels (age, gender and beard).

From the figures, we find that interventions on the latent variables constructed by CausalVAE, in general, show a better performance than on those constructed by DC-IGN, especially on cause concepts. The intervention on age in Fig. LABEL:CausalVAEdoage is a good example demonstrating the performance.

When CausalVAE controls the effect latent variables beard, it will not change other concepts on the reconstructed images. However, in DO-experiments under DC-IGN, other conceptual parts like gender will change even though we only intervene on the beard dimension. This fact shows certain entanglement of the concepts learned by DC-IGN, and these concepts do not follow a cause-effect relationship.

b.2 Downstream Task

Here we show the loss curves during the training. We use 85% of the synthetic data for training. We present the experiments on two synthetic data and each one includes the experiments of identifying cause labels and effect labels. In addition, CausalVAE achieves better accuracy than most of the baselines. It shows evidence that our proposed method learns conceptual representations.

The network designs of the classifiers are shown in Table 4.

encoder decoder(a) decoder(b)
4*96*96900 fc. 1ELU concepts( 4 300 fc. 1ELU ) concepts (4 300 fc. 1ELU)
900300 fc. 1ELU concepts (300300 fc. 1ELU) concepts(300300 fc. 1ELU)
3002*concepts*k fc. concepts(300 1024 fc. 1ELU) concepts(300 1024 fc.)
- concepts(1024 4*96*96 fc.) concepts(1024 4*96*96 fc.)
Table 2: Network design of models trained on synthetic data.
encoder decoder
- concepts(1

1 conv. 128 1LReLU(0.2), stride 1)

44 conv. 32 1LReLU (0.2), stride 2 concepts(44 convtranspose. 64 1LReLU (0.2), stride 1)
44 conv. 64 1LReLU (0.2), stride 2 concepts(44 convtranspose. 64 1LReLU (0.2), stride 2)
44 conv. 64 1LReLU(0.2), stride 2 concepts(44 convtranspose. 32 1LReLU (0.2), stride 2)
44 conv. 64 1LReLU (0.2), stride 2 concepts(44 convtranspose. 32 1LReLU (0.2), stride 2)
44 conv. 256 1LReLU (0.2), stride 2 concepts(44 convtranspose. 32 1LReLU (0.2), stride 2)
11 conv. 3, stride 1 concepts(44 convtranspose. 3 , stride 2)
Table 3: Network design of models trained on CelebA.
pendulum-cause pendulum-effect water-cause water-effect
450 fc. 1ELU 2(432 fc. 1ELU) 432 fc. 1ELU 432 fc. 1ELU
326 fc. 2(3232 fc. 1ELU) 322 fc. 322 fc.
- 2(323 fc.) - -
Table 4: Network designs of models for downstream tasks.