Machine learning is always challenged when transporting training knowledge to testing deployment, which is generally known as Domain Adaptation (DA) . As shown in Figure 1, when the target domain () is drastically different (, image style) from the source domain (
), deploying a classifier trained inresults in poor performance due to the large domain gap . To narrow the gap, conventional supervised DA requires a small set of labeled data in [12, 49], which is expensive and sometimes impractical. Therefore, we are interested in a more practical setting: Unsupervised DA (UDA), where we can leverage the abundant unlabelled data in [38, 30]. For example, when adapting an autopilot system trained in one country to another, where the street views and road signs are different, one can easily collect unlabelled street images from a camera-equipped vehicle cruise.
Existing UDA literature widely adopts the following two assumptions on the domain gap111A few early works  also consider target shift of , which is now mainly discussed in other settings like long-tailed classification .: 1) Covariate Shift [54, 3]: , where denotes the samples, , real-world vs. clip-art images; and 2) Conditional Shift [53, 32]: , where
denotes the labels, , in clip-art domain, the “pure background” feature extracted fromis no longer a strong visual cue for “speaker” as in real-world domain. To turn “” into “”, almost all existing UDA solutions rely on learning invariant (or common) features in both source and target domains [54, 19, 30]. Unfortunately, due to the lack of supervision in target domain, it is challenging to capture such domain invariance.
For example, in Figure 1, when , the distribution of “Background = pure” and “Background = cluttered” is already informative for “speaker” and “bed”, a classifier may recklessly tend to downplay the “Shape” features (or attributes) as they are not as “invariant” as “Background”; but, it does not generalize to , where all clip-art images have “Background = pure” that is no longer discriminative.
A popular approach to make up for the above loss of “Shape” is by imposing unsupervised reconstruction —any feature loss hurts reconstruction. However, it is well-known that discrimination and reconstruction are adversarial objectives, leaving the finding of their ever-elusive trade-off parameter per se an open problem [8, 15, 24].
To systematically understand how domain gap causes the feature loss, we propose to use the transportability theory  to replace the above “shift” assumptions that overlook the explicit role of semantic attributes. Before we introduce the theory, we first abstract the DA problem in Figure 1 into the causal graph in Figure 2. We assume that the classification task is also affected by the unobserved semantic feature (, shape and background), where denotes the generation of pixel-level image samples and denotes the definition process of semantic class. Note that these causalities have been already shown valid in their respective areas, , and can be image  and language generation , respectively. In particular, the introduction of the domain selection variable reveals that the fundamental reasons of the covariate shift and conditional shift are both due to .
Note that the domain-aware is the confounder that prevents the model from learning the domain-invariant causality . It has been theoretically proven that the confounding effect cannot be eliminated by statistical learning without causal intervention . Fortunately, the transportability theory offers a principled DA solution based on the causal intervention in Figure 2:
Note that the goal of DA is achieved by causal intervention using the -operator . To appreciate the merits of the calculus on the right-hand side of Eq. (1), we need to understand the following two points. First, is domain-agnostic as is separated from and given , thus generalizes to in testing even if it is trained on . Second, every stratum of is fairly adjusted subject to the domain prior , , forcing the model to respect “shape” as it is the only “invariance” that distinguishes between “speaker” and “bed” in controlled “cluttered” or “pure” backgrounds. Therefore, the semantic loss is eliminated.
However, Eq. (1) is purely declarative and far from practical, because it is still unknown how to implement the stratification and representation of the unobserved , not mentioning to transport it across domains. To this end, we propose a novel approach to make Eq. (1) practical in UDA:
Identify the number of by disentangled causal mechanisms. In practice, the number of can be large due to the combinations of multiple attributes, making Eq. (1) computationally prohibitive (, many network passes). Fortunately, if the attributes are disentangled, we can stratify according to the much smaller number of disentangled attributes, as the effect of each feature is independent with each other , , if and are disentangled. As detailed in Section 3.1, this motivates us to discover a small number of Disentangled Causal Mechanisms (DCMs) in unsupervised fashion , each of which corresponds to a feature-specific intervention between and , , mapping a real-world “speaker” image to its clip-part counterpart by changing “Background” from “cluttered” to “pure”.
Represent by proxy variables
. Yet, DCMs do not provide vector representations of, leaving difficulties in implementing Eq. (1
) as neural networks. In Section3.2, we show that the transformed output of each DCM taking as input can be viewed as a proxy of , who provides a theoretical guarantee that we can replace the unobserved with the observed to make Eq. (1) computable.
Overall, instead of transporting the abstract and unobserved in the original transportability theory of Eq. (1), we transport the concrete and learnable disentangled causal mechanisms, which generate the observed proxy of . Therefore, we term the approach as Transporting Casual Mechanisms (TCM) for UDA. Through extensive experiments, our approach achieves state-of-the-art performance on UDA benchmarks: 70.7% on Office-Home , 90.5% on ImageCLEF-DA  and 75.8% on VisDA-2017 . Specifically, we show that learning disentangled mechanisms and capturing the causal effect through proxy variables are the keys to improve the performance, which validates the effectiveness of our approach.
2 Related Work
Existing UDA works fall into the three categories: 1) Sample Reweighing [54, 9]. This line-up adopts the covariate shift assumption. It first models and . When minimizing the classification loss, each sample in is assigned an importance weight , , a target-like sample has a larger weight, which effectively encourages . 2) Domain Mapping [19, 37]. This approach focuses on the conditional shift assumption. It first learns a mapping function that transforms samples from to
through unpaired image-to-image translation techniques such as CycleGAN. Then, the transformed source domain samples are used to train a classifier. 3) Invariant Feature Learning. This is the most popular approach in recent literature [14, 31, 59]. It maps the samples in and to a common feature space, denoted as domain , where the source and target domain samples are indistinguishable. Some works [38, 30] adopted the covariate shift assumption, and aim to minimize the differences in using distance measures like Maximum Mean Discrepancy [56, 33] or using adversarial training to fool a domain discriminator [14, 5]. Others used the conditional shift assumption and aim to align [32, 31]. As the target domain samples have no labels, their
is either estimated through clustering[21, 62] or using a classifier trained with the source domain samples [50, 63].
They all aim to make the source and target domain alike while learning a classifier with the labelled data in the source domain, leading to the semantic loss. In fact, there are existing works that attempt to Alleviate Semantic Loss by learning to capture in unsupervised fashion: One line exploits that generates (via ) and learns a latent variable to reconstruct [5, 15]. The other line [7, 10] exploits , and aims to learn a domain-specific representation for each . However, this leads to an ever-illusive trade-off between the adversarial discrimination and reconstruction objectives. Our approach aims to fundamentally eliminate the semantic loss by providing a practical implementation of the transportability theory  based on causal intervention .
Though Eq. (1) provides a principled solution for DA, it is impractical due to the unobserved . To this end, the proposed TCM is a two-stage approach to make it practical. The first stage identifies the stratification by discovering Disentangled Causal Mechanisms (DCMs) ( Secton 3.1) and the second stage represents each stratification of with the proxy variable generated from the discovered DCMs (Section 3.2). The training and inference procedures are summarized in Algorithm 1 and 2, respectively.
3.1 Disentangled Causal Mechanisms Discovery
Physicists believe that real-world observations are the outcome of the combination of independent physical laws. In causal inference [18, 55], we denote these laws as disentangled generative factors, such as shape, color, and position. For example, in our causal graph of Figure 2, if the semantic attribute can be disentangled into generative causal factors , each one will independently contribute to the observation: . Even though is large, , the number of sum in Eq. (1) is expensive, the disentanglement can still significantly reduce the number to a small .
However, learning a disentangled representation of without supervision is challenging . Fortunately, we can observe the outcomes of : the source domain samples and the target domain samples generated from . This motivate us to discover pairs of end-to-end functions in unsupervised fashion, where and , as shown in Figure 3 (a). Each corresponds to a counterfactual mapping that transforms a sample to the counterpart domain by intervening a disentangled factor , , modifying the domain shift while fixing the values of other factors —this is essentially the definition of independent causal mechanisms , each of which corresponds to a disentangled causal factor. Hence we refer to these mapping functions as DCMs.
We first make a strong assumption that the one-to-one correspondence between and has been already established, then we relax this assumption by removing the correspondence for the proposed practical solution. Without loss of generality, we consider a source domain sample . For each , we obtain a transformed sample , corresponding to the interventional outcome of drawn from to . To ensure that the interventions are disentangled, we use the Counterfactual Faithfulness theorem [2, 60], which guarantees that is a disentangled intervention if and only if (proof in Appendix). Hence, as shown in Figure 3 (b), we use the CycleGAN loss  denoted as to make the transformed samples from each indistinguishable from the real samples in . Likewise, we apply the CycleGAN loss to each such that the transformed samples are close to real samples in .
Now we relax the above one-to-one correspondence assumption. We begin with a sufficient condition (proof in Appendix): if intervenes , the -th mapping function outputs the counterfactual faithful generation, , . To “guess” the one-to-one correspondence, we adopt a practical method by using the necessary conclusion: if , then corresponds to intervention. Specifically, training samples are fed into all in parallel to compute for each pair. Only the winning pair with the smallest loss is updated. The objective is given by:
The functions (, ) in the optimization objective denote their parameters for simplicity. Note that this is not sufficient, hence our approach has limitations. Still, this practice has been empirically justified in , and we leave a more theoretically complete approach as future work.
After training, we obtain DCMs , where the -th pair corresponds to . Hence we identify the number of as . Figure 4 shows an example with , where the 4 DCMs correspond to “Brightness”, “Exposure”, “Temperature” and “Sharpness”, respectively222Note that our DCM learning is orthogonal to the GAN models used. One can use more advanced GAN models to discover complex mechanisms such as shape or viewpoint changes..
3.2 Representing by Proxy Variables
While the learned DCMs identify the number of , they do not provide a vector representation of . Hence and in Eq. (1) are still unobserved. Fortunately, generates two observed variables and as shown in Figure 5, which are called proxy variables  of and make and estimable. We will first explain and as well as their related causal links.
. represents the DCMs outputs, , for a sample in , takes value from ; and for a sample in , takes value from . As detailed in Appendix, generates counterfactuals in the other domain by fixing and intervening , , is generated from . Hence is justified. Moreover, as the other domain is also labelled, denotes the predictive effects from the generated counterfactuals.
. is a latent variable encoded from by a VAE . The link is because the latent variable is trained to reconstruct . Furthermore, the latent variable of VAE is shown to capture some information of the underlying , justifying .
Note that the networks to obtain and are trained in an unsupervised fashion. Hence are observed in both and . Under this causal graph, we have a theoretically grounded solution to Eq. (1) using the theorem below (proof in Appendix as corollary to ).
Notice that we can set in Eq. (3) and use the labelled data in to learn (details given below). We prove in the Appendix that is invariant across domains. This leads to the following inference strategy.
We detail the training of , including the representation of its inputs and , the function forms of the terms in Eq. (3) and the training objective.
Representation of . When and are images, directly evaluating Eq. (5) in image space can be computationally expensive, not to mention the tendency for severe over-fitting. Therefore, we follow the common practice in UDA [31, 21] to introduce a pre-trained feature extractor backbone (, ResNet ). Hereinafter, denote their respective -d features. Note that of dimensions is encoded from ’s feature form ().
. We adopt an isotropic Gaussian distribution for. We implement in Eq. (3) with the following linear model and , respectively:
produces logits forclasses, predicts , , , , , and . With this function form, we can solve from Eq. (3) as (derivation in Appendix):
where denotes pseudo-inverse.
Overall Objective. The training objective is given by:
where denotes the parameters of the linear functions and , denotes the parameters of the backbone, denotes the parameters of the VAE, denotes the sum of Cross-Entropy loss to train and the mean squared error loss to train (see Eq. (6)). is the VAE loss, is the proxy loss, denotes the parameters of the discriminators used in , and is a trade-off parameter.
VAE Loss. is used to train the VAE, which contains an encoder and a decoder . Given a feature in , is given by:
where denotes KL-divergence and is set to the isotropic Gaussian distribution. Note that we only need to learn a VAE in , as is domain-agnostic.
Proxy Loss. In practice, the generated images from the DCMs may contain artifacts that contaminate the feature outputs from the backbone , making as feature no longer resembles the features in the counterpart domain. Hence we regularize the backbone with a proxy loss to enforce feature-level resemblance between the samples transformed by the DCMs and the samples in the counterpart domain. Given feature in , in and their corresponding DCMs outputs sets (as -d features), the loss is given by:
where are discriminators parameterized by that return a large value for features of real samples. Through min-max adversarial training, generated sample features become similar to the sample features in the counterpart domain and fulfill the role as a proxy.
pre-trained on ImageNet.
|Method||I P||P I||I C||C I||C P||P C||Avg|
|ResNet-50 ||74.8 0.3||83.9 0.1||91.5 0.3||78.0 0.2||65.5 0.3||91.2 0.3||80.7|
|DAN ||74.5 0.4||82.2 0.2||92.8 0.2||86.3 0.4||69.2 0.4||89.8 0.4||82.5|
|DANN ||75.0 0.6||86.0 0.3||96.2 0.4||87.0 0.5||74.3 0.5||91.5 0.6||85.0|
|RevGrad ||75.0 0.6||86.0 0.3||96.2 0.4||87.0 0.5||74.3 0.5||91.5 0.6||85.0|
|MADA ||75.0 0.3||87.9 0.2||96.0 0.3||88.8 0.3||75.2 0.2||92.2 0.3||85.8|
|CDAN ||77.7 0.3||90.7 0.2||97.7 0.3||91.3 0.3||74.2 0.2||94.3 0.3||87.7|
|ALP ||79.6 0.3||92.7 0.3||96.7 0.1||92.5 0.2||78.9 0.2||96.0 0.1||89.4|
|SymNets ||80.2 0.3||93.6 0.2||97.0 0.3||93.4 0.3||78.7 0.3||96.4 0.1||89.9|
|Baseline||77.2 0.5||89.7 0.2||96.1 0.3||92.1 0.4||75.2 0.3||93.5 0.4||87.3|
|TCM (Ours)||79.9 0.4||94.2 0.2||97.8 0.3||93.8 0.4||79.9 0.4||96.9 0.4||90.5|
Accuracy (%) and the standard deviation on the ImageCLEF-DA dataset with 6 UDA tasks, where all methods are fine-tuned from ResNet-50 
pre-trained on ImageNet.
We validated TCM on three standard benchmarks for visual domain adaptation:
is a benchmark dataset for ImageCLEF 2014 domain adaptation challenge, which contains three domains: 1) Caltech-256 (C), 2) ImageNet ILSVRC 2012 (I) and 3) Pascal VOC 2012 (P). For each domain, there are 12 categories and 50 images in each category. We permuted all the three domains and built six transfer tasks, , IP, PI, IC, CI, CP, PC.
Office-Home  is a very challenging dataset for UDA with 4 significantly different domains: Artistic images (A), Clipart (C), Product images (P) and Real-World images (R). It contains 15,500 images from 65 categories of everyday objects in the office and home scenes. We evaluated TCM in all 12 permutations of domain adaptation tasks.
VisDA-2017  is a challenging simulation-to-real dataset that is significantly larger than the other two datasets with 280k images in 12 categories. It has two domains: Synthetic, with renderings of 3D models from different angles and with different lighting conditions; and Real with natural real-world images. We followed the common protocol [31, 11] to evaluate on SyntheticReal task.
Evaluation Protocol. We followed the common evaluation protocol for UDA [31, 63, 11], where all labeled source samples and unlabeled target samples are used to train the model, and the average classification accuracy is compared in each dataset based on three random experiments. Following [63, 62], we reported the standard deviation on ImageCLEF-DA. For fair comparison, our TCM and all comparative methods used the backbone ResNet-50  pre-trained on ImageNet .
Implementation Details. We implemented each in DCMs as an encoder-decoder network, consisting of 2 down-sampling convolutional layers, followed by 2 ResNet blocks and 2 up-sampling convolutional layers. The loss for training DCMs consists of the adversarial loss, cycle consistency loss, and identity loss following the official code of CycleGAN. The encoder and decoder network in VAE were each implemented with 2 fully-connected layers. The min-max objective in Eq. (8) for the proxy loss was implemented using the gradient reversal layer . The number of DCMs
is a hyperparameter. We usedfor all Office-Home experiments and the VisDA-2017 experiment and conducted ablation on with ImageCLEF-DA.
Baseline. As existing domain-mapping methods either focus on the image segmentation task , or only evaluate on toy settings (, digit) , we implemented a domain-mapping baseline. Specifically, we trained a CycleGAN that transforms and , whose network architecture is the same as each DCM in our proposed TCM. Then we learned a classifier on the transformed samples with the standard cross-entropy loss, while using the loss in Eq. (10) to align the features of the transformed samples with the target domain features. In inference, we directly used the learned classifier.
Overall Results. As shown in Table 1, 2, 3, our method achieves the state-of-the-art average classification accuracy on Office-Home , ImageCLEF-DA , and VisDA-2017 . Specifically, 1) ImageCLEF-DA has the smallest domain gap, where the 3 domains correspond to real-world images from 3 datasets. Our TCM outperforms existing methods on 4 out of 6 UDA tasks. 2) Office-Home has a much larger domain gap, such as the Artistic domain (A) and Clipart domain (C). Besides improvements in average accuracy, our method significantly outperforms existing methods on the most difficult tasks, , on AC, on PC. 3) VisDA-2017 is a large-scale dataset with 280k images, where TCM performs competitively. Note that the training complexity and convergence speed of TCM are comparable with existing methods (details in Appendix). Hence overall, our TCM is generally applicable to small to large-scale UDA tasks with various sizes of domain gaps.
Comparison with Baseline. We have three observations: 1) From Table 1, 2, 3, we notice that the Baseline does not perform competitively on 3 benchmarks. This shows that the improvements from TCM are not the results of superior image generation quality from CycleGAN. 2) Baseline accuracy in Table 2 is much lower compared with our TCM with a single DCM () in Table 4 on ImageCLEF-DA. The only difference between them is that Baseline train a classifier directly with the transformed sample , while TCM() uses the proxy function in Eq. (5). Hence this validates that the proxy theory in Section 3.1 has some effects in removing the confounding effect, even without stratifying . As shown later, this is because can make up for the lost semantic in . 3) On all 3 benchmarks, when using multiple DCMs, TCM significantly outperforms Baseline, which validates our practical implementation of Eq. (5) by combining DCMs and proxy theory to identify the stratification and representation of .
|I P||P I||I C||C I||C P||P C||Avg|
Number of DCMs. We performed an ablation on the number of DCMs with ImageCLEF-DA dataset. The results are shown in Table 4 and plotted in Figure 6. We have three observations: 1) Among each step of , the largest difference occurs between and . This means that accounting for 2 disentangled factors can explain much of the domain shift due to in ImageCLEF-DA. 2) Across all tasks, the best performance is usually achieved with . This supports the effectiveness of stratifying according to a small number of disentangled attributes. 3) A large (, 10) leads to a slight performance drop. In practice, we observed that when is large, a few DCMs seldom win any sample from Eq. (2), and hence are poorly trained. This can impact their image generation quality, leading to reduced performance. We argue that this is because the chosen is larger than the number of factors that can be disentangled from the dataset. For example, if there exist factors , once DCMs establish correspondence with them, the rest DCMs will never win any sample, as they can never generate images that are more counterfactual faithful compared with intervention on each (see Counterfactual Faithfulness theorem in Appendix).
t-SNE Plot. In Figure 7, we show the t-SNE  plots of the features in and in Office-Home AC task after training, for both the current SOTA GVB-GD  and our TCM. GVB-GD is based on invariant-feature-learning and focuses on making and alike, while our TCM does not explicitly align them. We indeed observe a better alignment between the features in and in GVB-GD. However, as shown in Table 1, our method outperforms GVB-GD on AC. This shows that the current common view on the solution to UDA based on , , making domains alike while minimizing source risk, does not necessarily lead to the best performance. In fact, this is in line with the conclusion from recent theoretical works . Our approach is orthogonal to existing approaches and we demonstrate the practicality of a more principled solution to UDA, , establishing appropriate causal assumption on the domain shift and solving the corresponding transport formula (, Eq. (1)).
Alleviating Semantic Loss. We use Figure 8 to reveal the semantic loss problem on existing methods, and how our method can alleviate it. The figure shows the CAM  responses on images in and from the task in Office-Home dataset. We compared with two lines of existing approaches: invariant-feature-learning method (GVB-GD ) and domain-mapping method (our Baseline), and we generated CAM for each of them as shown in the red dotted box. For our TCM, the prediction is a linear combination of the effects from the two images: and the DCMs outputs (see Eq. (7)). As the locations of objects tend to remain the same in as in (see Figure 4), we visualize the overall CAM of our TCM in the blue box by combining the CAM from and
weighted by their contributions towards softmax prediction probability. On the left, we show the CAMs in. By looking at the CAM of GVB-GD and the Baseline, we observe that they focus on the contextual object semantics (, food that commonly appears together with “fork”) to distinguish “marker” and “fork”, where the effect from the object shape is mostly lost. In contrast, our TCM focuses on the marker and fork, , the shape semantic is preserved in training. On the right in , the contextual object semantic is no longer discriminative for the two classes. Hence GVB-GD and Baseline either focus on the wrong semantic (, plate) or become confused (focusing on a large area), leading to the wrong prediction. Thanks to the preserved shape semantic, our TCM focuses on the object and makes the correct prediction. This provides an intuitive explanation of how our TCM alleviates the semantic loss (more examples in Appendix).
We presented a practical implementation of the transportability theory for UDA called Transporting Causal Mechanisms (TCM). It systematically eliminates the semantic loss from the confounding effect. In particular, we proposed to identify the confounder stratification by discovering disentangled causal mechanisms and represent the unknown confounder representation by proxy variables. Extensive results on UDA benchmarks showed that TCM is empirically effective. It is worth highlighting that TCM is orthogonal to the existing UDA solutions—instead of making the two domains similar, we aim to find how to transport and what can be transported with a causality theoretic viewpoint of the domain shift. We believe that the key bottleneck of TCM is the quality of causal mechanism disentanglement. Therefore, we will seek more effective unsupervised feature disentanglement methods and investigate their causal mechanisms.
The authors would like to thank all reviewers and ACs for their constructive suggestions, and specially thank Alibaba City Brain Group for the donations of GPUs. This research is partly supported by the Alibaba-NTU Singapore Joint Research Institute, Nanyang Technological University (NTU), Singapore; A*STAR under its AME YIRG Grant (Project No. A20E6c0101); and the Singapore Ministry of Education (MOE) Academic Research Fund (AcRF) Tier 2 grant.
This appendix is organized as follows:
For preliminaries on structural causal model and do-calculus, we refer readers to Section 2 of .
Section A.3 provides implementation details. Specifically, in Section A.3.1, we provide the network architecture of DCMs, the implementation of CycleGAN loss and DCM training details. In Section A.3.2, we show the network architectures of our backbone, the VAE and discriminators, together with their training details. In Section A.3.3, we attend to some details in the experiment.
Section A.4 shows additional generated images from our DCMs and additional CAM results.
a.1 Proof and Derivation for Section 3
In this section, we will first derive the Counterfactual Faithfulness theorem. Then we will prove the sufficient condition in Section 3.1.
a.1.1 Counterfactual Faithfulness Theorem
We will first provide a brief introduction to the concept of counterfactual and disentanglement. Causality allows to compute how an outcome would have changed, had some variables taken different values, referred to as a counterfactual. In Section 3.1, we refer to each DCM in as a counterfactual mapping, where each (or ) essentially follows the three steps of computing counterfactuals  (conceptually): given a sample , 1) In abduction, is inferred from through ; 2) In action, the attribute is intervened by setting it to drawn from (or ), while the values of other attributes are fixed; 3) In prediction, the modified is fed to the generative process to obtain the output of the DCM (or ). More details regarding counterfactual can be found in .
Our definition of disentanglement is based on  of group theory. Let be a set of (unknown) generative factors, , such as shape and background. There is a set of independent causal mechanisms , generating images from . Let be the group acting on , , transforms using (, changing background “cluttered” to “pure”). When there exists a direct product decomposition and such that acts on , we say that each is the space of a disentangled factor. The causal mechanism is disentangled when its transformation in corresponds to the action of on .
We use and to denote the vector space of and respectively. We denote the generative process () as a function . Note that we consider the function as an embedded function , , a continuous injective function with continuous inversion, which generally holds for convolution-based networks as shown in . Without loss of generality, we will consider the mapping for the analysis below, which can be easily extended to . Our definition of disentangled intervention follows the intrinsic disentanglement definition in , given by:
Definition (Disentangled Intervention). A counterfactual mapping is a disentangled intervention with respect to , if there exists a transformation affecting only , such that for any ,
Then we have the following theorem:
Theorem (Counterfactual Faithfulness Theorem). The counterfactual mapping is faithful if and only if is a disentangled intervention with respect to .
Note that by definition, if is faithful, . To prove the above theorem, one conditional is trivial: if is a disentangled intervention, it is by definition an endomorphism of so the counterfactual mapping must be faithful. For the second conditional, let us assume a faithful counterfactual mapping . Given is embedded, the counterfactual mapping can be decomposed as:
where denotes function composition and affecting only . Now for any , the quantity can be similarly decomposed as:
Since is a transformation in that only affects , we show that faithful counterfactual transformation is a disentangled intervention with respect to , hence completing the proof.
With this theory, faithfulnessdisentangled intervention. In Section 3.1, we train to ensure (faithfulness) for every sample in , hence encouraging to be a disentangled intervention. Note that the above analysis can easily generalize to .
a.1.2 Sufficient Condition
We will prove the following sufficient condition: if intervenes , the -th mapping function outputs the counterfactual faithful generation, , the smallest .
Without loss of generality, we will prove for the mapping , which can be extended to . For a sample in , let . We modify by changing to a value drawn from . Denote the modified attribute as . Denote the sample with attribute as . Given intervenes , corresponds to a counterfactual outcome when is set to through intervention (or set as ). Now as , using the counterfactual consistency rule , we have . As is faithful with the Counterfactual Faithfulness theorem, we prove that , , the output of the -th mapping function, is also faithful,, the smallest .
a.2 Proof and Derivation for Section 4
In this section, we will first derive the Proxy Function theorem and the domain-agnostic nature of the proxy function, and then derive Eq. (7) under our chosen function forms in Section 4.
a.2.1 Proxy Function Theorem
We will derive for the general case where is any continuous proxy. We will assume that the confounder follows the completeness condition in 
, which accommodates most commonly-used parametric and semi-parametric models such as exponential families.
Given solves Eq. (3), we have:
From the law of total probability, we have:
which proves the Proxy Function theorem.
From Eq. (16), we have:
Hence from the completeness condition, we have . Hence we prove that is domain-agnostic.
Note that in Section 4, our proxy
is a continuous random variable takes values fromfor sample in or for sample in . This is a special case of the analysis above with the probability mass of centers around the set of its possible values.
a.2.2 Derivation of Eq. (7)
We derive Eq. (7) as a corollary to . The goal is to solve for under the function form in Eq. (6) from the formula below:
For simplicity, we define a standard multivariate Gaussian function . If and , we have:
Our function form for is given by and for is given by
, where the variance terms are omitted in the main text for brevity, as the final results only depend on the means. Specifically,is a symmetrical matrix with Eigen-decomposition given by , where is a full-rank matrix containing the eigen-vectors and is a diagonal matrix with eigen-values. We define , and hence . We can rewrite as:
where . Define and . We define
Now we can solve from
Specifically, let and
represent the Fourier transform ofand , respectively: