Interventional Robustness of Deep Latent Variable Models

by   Raphael Suter, et al.
Max Planck Society
ETH Zurich

The ability to learn disentangled representations that split underlying sources of variation in high dimensional, unstructured data is of central importance for data efficient and robust use of neural networks. Various approaches aiming towards this goal have been proposed in the recent time -- validating existing work is hence a crucial task to guide further development. Previous validation methods focused on shared information between generative factors and learned features. The effects of rare events or cumulative influences from multiple factors on encodings, however, remain uncaptured. Our experiments show that this already becomes noticeable in a simple, noise free dataset. This is why we introduce the interventional robustness score, which provides a quantitative evaluation of robustness in learned representations with respect to interventions on generative factors and changing nuisance factors. We show how this score can be estimated from labeled observational data, that may be confounded, and further provide an efficient algorithm that scales linearly in the dataset size. The benefits of our causally motivated framework are illustrated in extensive experiments.


page 18

page 23

page 27

page 30

page 31

page 32

page 33

page 34


Robust Disentanglement of a Few Factors at a Time

Disentanglement is at the forefront of unsupervised learning, as disenta...

Leveraging Relational Information for Learning Weakly Disentangled Representations

Disentanglement is a difficult property to enforce in neural representat...

Hierarchical Disentangled Representations

Deep latent-variable models learn representations of high-dimensional da...

There and back again: Cycle consistency across sets for isolating factors of variation

Representational learning hinges on the task of unraveling the set of un...

Learning Disentangled Latent Factors from Paired Data in Cross-Modal Retrieval: An Implicit Identifiable VAE Approach

We deal with the problem of learning the underlying disentangled latent ...

Generative Interventions for Causal Learning

We introduce a framework for learning robust visual representations that...

Disentanglement Analysis with Partial Information Decomposition

Given data generated from multiple factors of variation that cooperative...

Code Repositories


CS 7290 Project

view repo

1 Introduction

Learning deep representations in which different semantic aspects of data are structurally disentangled is of central importance for training robust machine learning models. Separating independent factors of variation could pave the way for successful transfer learning and domain adaptation

(Bengio et al., 2013).

Imagine the example of a robot learning multiple tasks by interacting with his environment. For data efficiency, the robot can learn a generic representation architecture that maps his high dimensional sensory data to a collection of general, compact features describing his surrounding. For each task, only a subset of features will be required. If the robot is instructed to grasp an object, he must know the shape and the position of the object, however, its color is irrelevant. On the other hand, when pointing to all red objects is demanded, only position and color are required.

Having a disentangled representation, where each feature captures only one factor of variation, allows the robot to build separate (simple) models for each task based on only a relevant and stable subselection of these generically learned features. We argue that robustness of the learned representation is a crucial property when this is attempted in practice. Since features are selected based on their expressiveness for causal factors that are relevant for a specific task (e.g. Rojas-Carulla et al., 2018), we specifically want them not to be affected by changes in any other factor. In our example, the robot assigned with the grasping task should be able to build a model using features well describing shape and position of the object. For this model to be robust, however, these features must not be affected by changing color (or any other nuisance factor).

Towards this goal, the validation of existing disentangling representation learning algorithms is of high importance. As we will discuss in section 2, previous approaches focused on the information content of latent encodings about generative factors (e.g. based on feature importance (Eastwood and Williams, 2018) or mutual information (Ridgeway and Mozer, 2018)

), where one feature is expected to capture information about at most one generative factor. In the limiting case of no shared information this will also provide robustness. However, already ‘little’ dependency on extraneous factors can lead to significant effects when for example extreme events with small occurance probabilities happen. In our experiments we show that this discrepancy between robustness and information based disentanglement scores can already be observed in a simple, noise free dataset. This is why we propose a causally motivated metric to quantitatively evaluate to what extent robust disentanglement is provided in a deep latent variable model, in terms of external interventions on the system and changes in nuisance factors.

In order to do so, we first introduce disentanglement as a property of a causal process (Pearl, 2009; Peters et al., 2017) responsible for the data generation, as opposed to only a characteristic of the encoding. Concretely, we call a causal process disentangled when the parents of the generated observations do not affect each other (i.e. there is no total causal effect between them). We call these parents elementary ingredients. In the example above, we view color and shape as elementary ingredients, as both can be changed without affecting each other. Still, there can be dependencies between them when for example our experimental setup is confounded by the capabilities of the 3D printers that are used to create the objects (e.g. certain shapes can only be printed in some colors).

Combining these disentangled causal processes with the consequent encoding to a unifying probabilistic framework allows us to study interventional effects on feature representations and estimate them from observational data. This is for example of interest when benchmarking disentanglement approaches based on ground truth data or trying to evaluate robustness of a deep representation w.r.t. known nuisance factors (e.g. domain changes). In the example of robotics, knowledge about the generative factors (e.g. the color, shape, weight, etc. of an object to grasp) is often availabe and can even be controlled in experiments.

We will start by first giving an overview of previous work in finding disentangled representations and how they have been validated. In section 3 we introduce our unifying framework for the joint treatment of the disentangled causal process and its learned representation. Consequently, we introduce our notion of interventional effects on encodings and the following interventional robustness score in section 4. In section 5 we show how this score can be estimated from observational data with an efficient algorithm. Section 6 provides experimental evidence in a standard disentanglement benchmark dataset for the need of a robustness based disentanglement criteria.

Our contributions:

  • We introduce a unifying probabilistic framework of disentangled causal processes and consequent encodings which allows us to study the robustness of deep latent variable models to changes in nuisance factors or interventions on the system. This leads to our validation metric, the interventional robustness score.

  • We show how this metric can be estimated from observational data and provide an efficient algorithm that scales linearly in the dataset size.

  • Motivated by this metric, we additionally present a new visualisation technique of encodings which provides an intuitive understanding of dependency structures and robustness of learned features.


Henceforth, we denote the true generative factors of high dimensional representations as . For clear distinction, the latent variables learned by a deep representation model, e.g. a variational auto-encoder (VAE) (Kingma and Welling, 2014), are denoted as . We will use the notation to describe the encoder which in case of VAEs corresponds to the posterior mean of

. Capital letters denote random variables and lower case stand for observations thereof.

being a deterministic function of a random variable is here also considered as a random quantity. Subindices for a set or for a single index denote the selected components of a multivariate variable. A backslash denotes all components except those in .

2 Related Work

In the framework of variational auto-encoders (VAEs) (Kingma and Welling, 2014) the (high dimensional) observations are modelled to be generated from some latent features with chosen prior according to the probabilistic model . The generative model as well as the proxy posterior can be estimated using neural networks by maximizing the variational lower bound (ELBO) of , i.e.


This objective function a priori does not encourage much structure on the latent space (except some similarity to the chosen prior which is usually an isotropic Gaussian). In particular, the most important ingredient to this optimization problem is the model’s reconstruction capabilities, which are unchanged for bijective transformations of the latent space. More precisely, for any given encoder and decoder we can use a bijective function to adapt and yielding the same reconstruction . This is why it should be possible to enforce special structure on (by implicitly designing ) without losing information (to the extent that and still need to be parameterizable by the same neural network architecture).

Various proposals for such structure imposing regularization have been made, either with some sort of supervision (e.g. Siddharth et al., 2017; Bouchacourt et al., 2017; Liu et al., 2017; Mathieu et al., 2016; Cheung et al., 2014) or completely unsupervised (e.g. Higgins et al., 2017; Kim and Mnih, 2018; Chen et al., 2018; Kumar et al., 2018; Esmaeili et al., 2018).

These approaches were mostly based on heuristic methods, e.g.

Higgins et al. (2017) proposed with

-VAE to penalize the Kullback-Leibler divergence (KL) term in the VAE objective (

1) more strongly, which encourages similarity to the factorized prior distribution. Others used techniques to encourage statistical independence between the different components in , e.g. FactorVAE (Kim and Mnih, 2018) or -TCVAE (Chen et al., 2018)

, similar to independent component analysis

(e.g. Comon, 1994). With disentangling the inferred prior (DIP-VAE), Kumar et al. (2018) proposed encouraging factorization of .

A special form of structure in the latent space which has gained a lot of attention in recent time is referred to as disentanglement (Bengio et al., 2013). This term encompasses the understanding that each learned feature in should represent structurally different aspects of the observed phenomenas (i.e. capture different sources of variation).

Various methods to validate a learned representation for disentanglement based on known ground truth generative factors have been proposed (e.g. Eastwood and Williams, 2018; Ridgeway and Mozer, 2018; Chen et al., 2018; Kim and Mnih, 2018). While a universal definition of disentanglement is missing, the most widely accepted notion is that one feature should capture information of only one generative factor (Eastwood and Williams, 2018; Ridgeway and Mozer, 2018). This has for example been expressed as the mutual information of a single latent dimension with generative factors (Ridgeway and Mozer, 2018), where in the ideal case each has some mutual information with one generative factor but none with all the others. Similarly, Eastwood and Williams (2018)

trained predictors (e.g. lasso or random forests) for a generative factor

based on the representation . In a disentangled model, each dimension is only useful (i.e. has high feature importance) to predict one of those factors. We discuss these two methods in further details below, as we believe them to be the most insightful measures so far and also use them in our experiments.

Chen et al. (2018) also made use of a mutual information estimate between latent dimensions and generative factors. However, their metric does not allow for multiple latent dimensions describing one causal factor. Rather, it measures a concept which Eastwood and Williams (2018) called completeness, which means that there exists only one latent variable that captures information about a specific generative factor. This is in contrast to disentanglement, which requires that a feature is only informative for one generative factor but allows that there are multiple features that are so.

Validation without known generative factors is still an open research question and so far it is not possible to quantitatively validate disentanglement in an unsupervised way. The community has been using latent traversals (i.e. changing one latent dimension and consequently regenerating the image) for visual inspection when supervision is not available (see e.g. Chen et al., 2018). This can be used to encounter physically meaningful interpretations of each dimension.

Here, we also focus on supervised validation but approach the problem from a new perspective by looking at robustness properties instead of having an information based view.

The notion of equivariance, which stands for a predictable change in the representation due to some transformations of the input images (Lenc and Vedaldi, 2015), is closely related to disentanglement. Invariance is a special case of equivariance under which certain representations should not change at all under shifts in nuisance factors (Goodfellow et al., 2009; Cohen and Welling, 2014). As Kim and Mnih (2018) argued, a disentangled representation easily allows us to extract an invariant representation by neglecting those factors that capture the information about irrelevant factors. We believe that this is a key motivation for obtaining disentangled representations in the first place, since a good structure in should allow taking subselections of features (from this generically learned collection) for specific tasks that are robust with respect to nuisance factors.

This is why our proposal for measuring disentanglement, and feature robustness in general, involves quantifying the invariance of certain representations under shifts in some generative factor while others are being kept constant. While the existing invariance literature named above focused on changes in the image space (e.g. rotations and translations of images), we are interested in robustness in terms of the generative causal process responsible for the high dimensional observations (i.e. what is being captured by an image).

In order to do so, we first offer a causally motivated, general view on disentanglement. We view as approximate model for an independent causal mechanism responsible for generating and we would like to find as proxy for its elementary ingredients. Similar to Parascandolo et al. (2018) we are interested in capturing causal mechanisms. However, we do not assume to have both the input and output of this process, but only a rich collection of observations from its output. From that, also the elementary ingredients to the process (causes) are of interest.

Information Based Disentanglement Metrics

Since we will make comparisons to information based evaluation methodologies by Eastwood and Williams (2018) and Ridgeway and Mozer (2018) we here give a more in depth overview of these methods. The validation method of Eastwood and Williams (2018) is based on training a predictor model (e.g. a random forest) which tries to estimate the true generative factors based on the latent encoding. The way disentanglement can be observed is by analyzing the feature importances implicit in this regressor. Intuitively, we expect that in a disentangled representation, each dimension contains information about one single generative factor. In particular, Eastwood and Williams (2018) proceed as follows: Given a labeled dataset with generative factors and observations and a given encoder (to be evaluated), they first create the set of features . Using these features as predictors, they train an individual regressor for each generative factor , i.e. . As the basis for further computations, they set up a matrix of relative importances based on these feature importance values. In particular, denotes the relative importance of the feature when predicting .

Plotting the matrix gives a good first impression of the disentanglement capabilities of an encoder. Ideally, we would want to see only one large value per row while the remaining entries should be zero. In our experimental evaluations we plot this matrix (together with similarly interpretable matrices of the other metrics) as is shown for example in Figure 6 on page 6.

To explicitly quantify this visual perspective, Eastwood and Williams (2018) summarize disentanglement as one score value which measures to what extent indeed each latent dimension can only be used to predict one generative factor (i.e. sparse rows). It is obtained by first computing the ‘probabilities’ of being important to predict :

and consequently the entropy of this distribution: , where is the number of generative factors. The disentanglement score of variable is then defined as For example, if only one generative factor can be predicted with , i.e. , we obtain . If the explanatory power spreads over all factors equally, the score is zero. Using relative variable importance , which accounts for dead or irrelevant components in , they find an overall disentanglement score as weighted average . When later plotting the full importance matrices, we also provide information about the individual feature disentanglement scores in the corresponding row labels. These feature-wise scores are better comparable between metrics since all of them have different heuristics to obtain the (weighted) average .

As an additional measure to obtain a more complete picture of the quality of the learned code, they additionally propose the informativeness score. It tells us how much information about the generative factors is captured in the latent space and is computed as the out-of-bag prediction accuracy of the regressors . In our evaluations in section 6 we will also provide this score, as there is often a trade-off between a disentangled structure and information being preserved.

The mutual information based metric by Ridgeway and Mozer (2018) proceeds in a similar way to Eastwood and Williams (2018). However, instead of relying on a random forest to compute the feature importances, they use an estimate of the mutual information between encodings and generative factors. In particular, they also first compute an importance matrix where the element corresponds to the mutual information between and . We also provide plots of this matrix whenever evaluations are made (e.g. Figure 6 on page 6). Another difference to Eastwood and Williams (2018) is that Ridgeway and Mozer (2018) do not compute entropies to measure the deviation from the ideal case of having only one large value per row. Instead, they compute a normalized squared difference between each row and its idealized case where all values except the largest are set to zero. To summarize the disentanglement scores of different dimensions in a feature space they use an unweighted average.

3 Causal Model

We will now first talk about properties of the true underlying causal process we assume for the data generating mechanism. In a second step we discuss what consequences this has when trying to match encodings with causal factors in a deep latent variable model.

3.1 Disentangled Causal Model

As opposed to previous approaches that defined disentanglement heuristically as properties of the learned latent space, we take a step back and first introduce a notion of disentanglement on the level of the true causal mechanism (or data generation process). Consequently, we can use this definition to better understand a learned probabilistic model for latent representations and evaluate its properties.

We assume to be given a set of observations from a (potentially very high dimensional) random variable . In our model, the data generating process can be described by causes of variation (generative factors), , i.e. . These factors are generally assumed to be unobserved and are object of interest when doing deep representation learning. In particular, knowledge about could be used to build lower dimensional predictive models, not relying on the (unstructured) itself. This could be classic prediction of a label in causal direction if or anti-causal direction if , also in a domain change robust fashion when we know that the domain has an impact on , i.e. .

Having these potential use cases in mind, we assume the generative factors themselves to be confounded by (multi-dimensional) , which can for example include a potential label or source . Hence, the resulting causal model allows for statistical dependencies between latent variables and , , when they are both affected by a certain label, i.e. .

However, a crucial assumption of our model is that these latent factors should represent elementary ingredients to the causal mechanism generating (to be defined below), which can be thought of as descriptive features of that can be changed without affecting each other (i.e. there is no causal effect between them). We formulate this assumption of a disentangled causal model more precisely:

[Disentangled Causal Model] We assume a causal model for with generative factors , described by the independent mechanisms , where itself could generally be influenced by confounders . This causal model for is called disentangled if and only if it can be described by a SCM of the form:

where the noise variables are mutually independent. In particular, this model implies that .

When we refer to conditional probability density functions (PDFs) as independent causal mechanisms, we assume them to be induced by an underlying structural equation (SE) in the SCM

(Peters et al., 2017). The corresponding graphical model is shown in Figure 1.

Figure 1: Disentangled Causal Mechanism: This graphical model encompasses our assumptions on a disentangled causal model. stands for a confounder, are the generative factors (or elementary ingredients) and the observed quantity. In general, there can be multiple confounders affecting a range of elementary ingredients each.

This definition encompasses our understanding of elementary ingredients , of the causal process. Each ingredient should work on its own and is changable without affecting others. This is similar to the understanding of Thomas et al. (2017) who used the notion of independently controllable factors

in the reinforcement learning setting. While they refered with controllability to possible actions by an agent, our setting is more general as it describes any causal process.

Based on this rather general view on the data generation process, we can make the following observations which we believe can inspire notions of disentanglement and deep latent variable models in general. We will also make use of these properties in the consequent discussion.

[Properties of a Disentangled Causal Process] A disentangled causal process as introduced in Definition 3.1 fulfills the following properties:

  1. [(a)]

  2. describes an independent causal mechanism which should be invariant to changes in the distributions .

  3. In general, the latent causes can be dependent

    Only if we condition on the confounders in the data generation they are independent

  4. Knowing what observation of we obtained renders the different latent causes dependent, i.e.

  5. The latent factors already contain all information about confounders that is relevant for , i.e.:

    where denotes the mutual information.

  6. There is no total causal effect from to for . This is equivalent to saying that performing interventions on does not change the distribution of , i.e.

  7. The remaining components of , i.e. , are a valid adjustment set to estimate interventional effects from to based on observational data, for all . That means:

  8. If there is no confounding, conditioning is sufficient to obtain the post interventional distribution of :

Property 1 directly follows from Definition 3.1 and the definition of an independent causal mechanism. 2 and 3 can be read off the graphical model (Koller et al., 2009) in Figure 1 which does not contain any arrow from to for by Definition 3.1 of the constrained SCM. This is due to the fact that any distribution implied by an SCM is Markovian with respect to the corresponding graph (Peters et al., 2017, Prop. 6.31). 4 follows from the data processing inequality since we have . The non-existence of a directed path from to implies that there is no total causal effect (Peters et al., 2017, Prop. 6.14). This, in turn, is equivalent to property 5 (Peters et al., 2017, Prop. 6.13). Finally, since there are no arrows between the ’s, the backdoor criterion (Peters et al., 2017, Prop. 6.41) can be applied to estimate the interventional effects in 6. In particular, blocks all paths from to entering through the backdoor (i.e. ) but at the same time does not contain any descendents of since by definition . Property 7 also follows from by using parent adjustment (Peters et al., 2017, Prop. 6.41), where in the case no confounding . These properties is why the constrained SCM in Definition 3.1 is important for further estimation.

We will later use these properties to understand existing disentangling representation learning algorithms. Property 6, which circumvents the need of confounder information, will be used to estimate interventional effects on latent variable models based on observational data. In the case of most disentanglement benchmarking datasets there is no confounding present. In that case, property 7 can be used to replace interventions by conditioning.

3.2 Disentangled Latent Variable Model

We can now understand generative models with latent variables (e.g. the decoder in VAEs) as models for the causal mechanism in 1 and the inferred latent space through as proxy to the generative factors . Property 4 gives hope that under an adequate information bottleneck we can indeed recover information about causal parents and not the confounders. Ideally, we would hope for a one-to-one correspondance of to for all . In some situations it might be useful to learn multiple latent dimensions for one causal factor for a more natural description, e.g. describing an angle as and (Ridgeway and Mozer, 2018). Hence, we will generally allow the encodings to be dimensional, where usually . -VAE (Higgins et al., 2017) heuristically encourages factorization of through penalization of the KL to its prior . Due to property 3 other approaches were introduced making use of statistical independence (Kim and Mnih, 2018; Chen et al., 2018; Kumar et al., 2018). Esmaeili et al. (2018) allow dependence within groups of variables in a hierarchical model (i.e. with some form of confounding where property 2 becomes an issue) by specifically modelling groups of dependent latent encodings. As opposed to the above mentioned approaches, this requires some prior knowledge on the generative structure. We will make use of property 6 to solve the task of using observational data to evaluate deep latent variable models for disentanglement and robustness.

Figure 2: Unifying Framework of Representation Learning

Viewing the encoding , a deterministic function (neural network) of random , as random quantity itself, we understand deep representation learning as process which is described in the graphical model shown in Figure 2. This illustrates our unified probabilitic perspective on representation learning which encompassed the data generating process () as well as the consecutive encoding through (). Based on this viewpoint, we define the interventional effect of a group of generative factors on the consequent latent space encodings with proxy posterior from a VAE, where and as:


This definition is consistent with the above graphical model as it implies that .

4 Interventional Robustness

Building up on the definition of interventional effects on deep feature representations in Eq. (

2), we now derive a robustness measure of encodings with respect to changes in certain generative factors.

Let and be groups of indices in the latent space and generative space. For generality, we will henceforth talk about robustness of groups of features with respect to interventions on groups of generative factors . We believe that having this general formulation of allowing disagreements between groups of latent dimensions and generative factors provides more flexibility, for example when multiple latent dimensions are used to describe one phenomena (Esmaeili et al., 2018) or when some sort of supervision is available through groupings in the dataset according to generative factors (Bouchacourt et al., 2017). Below, we will also discuss special cases of how these sets can be chosen.

If we assume that the encoding captures information about the causal factors and we would like to build a predictive model that only depends on those factors, we might be interested in knowing how robust our encoding is with respect to nuisance factors , where . To quantify this robustness for specific realizations of and we make the following definition: [Post Interventional Disagreement] For any given set of feature indices , and and we call


the post interventional disagreement () in due to given . is an adequate distance function (e.g. -norm). now quantifies the shifts in our inferred features we experience when the generative factors are externally changed to while the generative factors that we are actually interested in capturing with (i.e. ) remain at the predefined setting of . Using expected values after intervention on the generative factors (i.e. do-notation), as opposed to regular conditioning, allows for better interpretation of the score when the factors are dependent due to confounding. In particular, the do-notation can be interpreted as setting these generative values from the outside by for example designing a new experiment. This neglects the history that might have led to the observations in the collection phase of the observational dataset. If there was no confounding in the data collection process, this definition is equivalent to regular conditioning (see Proposition 1 7).

For robustness reasons, we are interested in the worst case effect any change in nuisance parameters might have. We call this the maximal post interventional disagreement ():


This measure no longer depends on a specific intervention value but only on the set of factors that can be intervened on. Still, this metric is computed for a specific realization of . This is why we weight this score according to occurance probabilities of , which leads us to the expected :


is now a (unnormalized) measure in quantifying the worst-case shifts in the inferred we have to expect due to changes in even though our generative factors of interest remain the same. This is for example of interest when the robot in our introductory example learns a generic feature representation of his environment from which he wants to make a subselection of features in order to perform a grasping task. For this model to work well, the generative factor of the object are important, however, factor is not so. Now, the robot can evalute how robust his features are at performing the task requiring but being irrespective to .

We propose to normalize this quantity with

, which represents the expected maximal deviation from the mean encoding of

without fixed generative factors as it is often useful to have a normalized score for comparisons. Hence, we define: [Interventional Robustness Score]


This score yields for perfect robustness (i.e. no harm is done by changes in ) and for no robustness (i.e. keeping the generative factor fixed does not decrease expected worst case deviation from the mean of ). has a similar interpretion to a

value in regression. Instead of measuring the captured variance, it looks at worst case deviations of inferred values.

Special Case: Disentanglement

One important special case includes the setting where , and . This corresponds to the degree to which is robustly isolated from any extraneous causes (assuming captures ), which can be interpreted as the concept of disentanglement in the framework of Eastwood and Williams (2018). We define


as disentanglement score of . The maximizing is interpreted as the generative factor that captures predominantly. Intuitively, we have robust disentanglement when a feature reliably captures information about the generative factor , where reliable means that the inferred value is always the same when stays the same, regardless of what the other generative factors are doing.

In our evaluations of disentanglement, we also plot the full dependency matrix with (see for example Figure 6 on page 6) next to providing the values and their weighted average.

Special Case: Domain Shift Robustness

If we understand one (or multiple) generative factor as corresponding to source domains which we would like to generalize over, we can use to evaluate robustness of a selected feature set against such domain shifts. In particular,

quantifies how robust is when changes in occur. If we are building a model predicting label based on some (to be selected) feature set , we can use this score to make a trade-off between robustness and predictive power. For example, we could use the best performing set of features among all those that satisfy a given robustness threshold.

5 Estimation

We now derive a sampling algorithm to estimate from a observational dataset where and with each being discrete and finite. In case of continous we first need to perform a discretization. The discretization steps trade off bias and variance of the estimate through the number of samples that are available per combination of generative factors.

We will provide an estimation procedure for which we defined in Eq. 5 as:


From that, also the can be computed. In section 5.2 we provide a simplified version that is sufficient for disentanglement benchmarking based on perfectly crossed noise free datasets. Readers most interested in this application might skip to that part.

The main ingredient for this estimation to work is provided by our constrained causal model (i.e. a disentangled process) that implies that the backdoor criteria can be applied, which we showed in Proposition 1. Further, we already saw in Eq. (2) that . This can be used to write the conditional expected value of as:


where the elements of encoding are defined as:

It is now apparent how this formula can be used to estimate the expected value using the sample mean (or a robust alternative in case outliers in

are to be expected) based on a set of samples drawn from

using the law of large numbers (LLN), i.e.


However, all we are given are the samples drawn from where the generative factors could be confounded . This is why we now provide an importance sampling based adjusted estimation of the expected value of any function of the observations after an intervention on has occured and while conditioning on , i.e. . This procedure can then be used to estimate Eq. (9), as a special case with , directly from .

By denoting the Kronecker-delta as we obtain:


We can rewrite the weighting term as:

which gives us the natural interpretation that samples that would occur more often together with a certain need to be downweighted in order to correct for the confounding effects. We can also see that in case of statistical independence between the generative factors, this reweighting is not needed and we can simply use the sample mean with the subselection of the dataset .

Since we assume to be discrete, we can estimate these reweighting factors from observed frequencies. Even though this sampling procedure looks non-trivial, we show in section 5.1 how it can be used to obtain an estimation algorithm for .

5.1 General Observational Dataset

We provide a specific approach how can be estimated from a generic observational dataset in Algorithm 1.

2:     dataset
3:     trained encoder
4:     subsets of factors and
6:     encode all samples to obtain
7:     estimate and from relative frequencies in
9:     find all realizations of in :
10:     partition the dataset according to those realizations:
11:     for  do
12:         estimate using Eq. (5) and samples
13:         partition according to realizations of :
14:         initialize
15:         for  do
16:              estimate
              with Eq. (5) and samples
17:              compute
18:              update
19:         end for
20:     end for
21:     return
Algorithm 1 EMPIDA Estimation

We further gain insight into the computational complexity of this procedure:

[Computational Complexity] The estimation algorithm described in Algorithm 1 scales in the dataset size . The encodings in line 6 requires one pass through the dataset . So does the estimation of the occurance frequencies in line 7 as one can use a hash table to keep track of the number of occurances of each possible realization. Therefore, the preprocessing steps scale with .

Further, also the partitioning of the full dataset into , which is done in lines 9, 10 and 13, can be done with two passes through the dataset by using hash tables: In the first pass we create buckets with as keys. Consequently, we can pass through all of these buckets to create subbuckets where is used as key. This reasoning is further illustrated in Figure 3 and leads us to the complexity of the partitioning.

The remaining computational bottleneck are the computations of mean in line 12 and in line 16. Using Eq. (5) we obtain to compute mean and to compute . Since we already computed the encodings as well as the reweighting terms in the preprocessing step, these summations scale as and . As can be seen in Figure 3, it holds that as well as which implies the total computational complexity of .

fixed fixed remaining
}26mm[] }66mm[]
Figure 3: Partitioning of Dataset: In order to estimate we first partition the dataset according to possible realizations of (first column), where we assume there are many. This partitioning can be done in linear time by using hash tables with as keys. For each such partition we can further split these sub-datasets according to realizations of to obtain (illustrated as boxes in third column). We denote with the number of realizations of that occur together with (i.e. can be found in ). This takes time per partition or in total by again making use of hash tables.
Real World Considerations:

Though this estimation procedure scales in the dataset size, the required number of observations for a fixed estimation quality (i.e. if should stay constant) might become very large, as we have exponentially growing (in and ) many possible combinations to consider. This is why some trade-offs need to be made when comparing large sets of factors. The estimation for , however, usually works well. One trade-off parameter is the discretization step of of ’s. Partitioning a factor into fewer realizations yields less possible combinations and hence larger sets . In general, the more noise we expect in the larger the sets we want to have in order to obtain stable estimates of the expected values. Also, if we allow for fewer possible realizations in the generative factors, the smaller our dataset can be to cover all relevant combinations. However, larger discretization steps come at the cost of having a less sensitive score. Also note that taking the supremum is in general not vulnerable to outliers in as we compute distances of expected values. When outliers are to be expected, a robust estimate for these expected values can be used. Only when little data is available special care needs to be taken.

5.2 Crossed Dataset without Noise: Benchmarking Disentanglement

In many benchmark datasets for disentanglement (e.g. dsprites) the observations are obtained noise free and the dataset contains all possible crossings of generative factors exactly ones. This makes the estimation of the disentanglement score very efficient, as we have . Furthermore, since no confounding is present, we can use conditioning to estimate the interventional effect, i.e. , as seen in Proposition 1 7. In order to obtain the disentanglement score of , as discussed in Eq. (7), we therefore just need to compute the value:

for all generative factors and realizations thereof . is the set of observations that was generated with a particular configuration . We choose the maximum value w.r.t.  as and average over realizations to obtain: