Transporting Causal Mechanisms for Unsupervised Domain Adaptation

07/23/2021 ∙ by Zhongqi Yue, et al. ∙ Singapore Management University Nanyang Technological University 0

Existing Unsupervised Domain Adaptation (UDA) literature adopts the covariate shift and conditional shift assumptions, which essentially encourage models to learn common features across domains. However, due to the lack of supervision in the target domain, they suffer from the semantic loss: the feature will inevitably lose non-discriminative semantics in source domain, which is however discriminative in target domain. We use a causal view – transportability theory – to identify that such loss is in fact a confounding effect, which can only be removed by causal intervention. However, the theoretical solution provided by transportability is far from practical for UDA, because it requires the stratification and representation of the unobserved confounder that is the cause of the domain gap. To this end, we propose a practical solution: Transporting Causal Mechanisms (TCM), to identify the confounder stratum and representations by using the domain-invariant disentangled causal mechanisms, which are discovered in an unsupervised fashion. Our TCM is both theoretically and empirically grounded. Extensive experiments show that TCM achieves state-of-the-art performance on three challenging UDA benchmarks: ImageCLEF-DA, Office-Home, and VisDA-2017. Codes are available in Appendix.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 8

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning is always challenged when transporting training knowledge to testing deployment, which is generally known as Domain Adaptation (DA) [38]. As shown in Figure 1, when the target domain () is drastically different (, image style) from the source domain (

), deploying a classifier trained in

results in poor performance due to the large domain gap [4]. To narrow the gap, conventional supervised DA requires a small set of labeled data in  [12, 49], which is expensive and sometimes impractical. Therefore, we are interested in a more practical setting: Unsupervised DA (UDA), where we can leverage the abundant unlabelled data in  [38, 30]. For example, when adapting an autopilot system trained in one country to another, where the street views and road signs are different, one can easily collect unlabelled street images from a camera-equipped vehicle cruise.

Figure 1: A DA example from source “Real World” to target “Clipart” domain in the Office-Home benchmark [57].

Existing UDA literature widely adopts the following two assumptions on the domain gap111A few early works [61] also consider target shift of , which is now mainly discussed in other settings like long-tailed classification [28].: 1) Covariate Shift [54, 3]: , where denotes the samples, , real-world vs. clip-art images; and 2) Conditional Shift [53, 32]: , where

denotes the labels, , in clip-art domain, the “pure background” feature extracted from

is no longer a strong visual cue for “speaker” as in real-world domain. To turn “” into “”, almost all existing UDA solutions rely on learning invariant (or common) features in both source and target domains [54, 19, 30]. Unfortunately, due to the lack of supervision in target domain, it is challenging to capture such domain invariance.

For example, in Figure 1, when , the distribution of “Background = pure” and “Background = cluttered” is already informative for “speaker” and “bed”, a classifier may recklessly tend to downplay the “Shape” features (or attributes) as they are not as “invariant” as “Background”; but, it does not generalize to , where all clip-art images have “Background = pure” that is no longer discriminative.

Figure 2: Causal graph of domain adaptation.

A popular approach to make up for the above loss of “Shape” is by imposing unsupervised reconstruction [5]—any feature loss hurts reconstruction. However, it is well-known that discrimination and reconstruction are adversarial objectives, leaving the finding of their ever-elusive trade-off parameter per se an open problem [8, 15, 24].

To systematically understand how domain gap causes the feature loss, we propose to use the transportability theory [41] to replace the above “shift” assumptions that overlook the explicit role of semantic attributes. Before we introduce the theory, we first abstract the DA problem in Figure 1 into the causal graph in Figure 2. We assume that the classification task is also affected by the unobserved semantic feature (, shape and background), where denotes the generation of pixel-level image samples and denotes the definition process of semantic class. Note that these causalities have been already shown valid in their respective areas, , and can be image [16] and language generation [6], respectively. In particular, the introduction of the domain selection variable reveals that the fundamental reasons of the covariate shift and conditional shift are both due to .

Note that the domain-aware is the confounder that prevents the model from learning the domain-invariant causality . It has been theoretically proven that the confounding effect cannot be eliminated by statistical learning without causal intervention [43]. Fortunately, the transportability theory offers a principled DA solution based on the causal intervention in Figure 2:

(1)

Note that the goal of DA is achieved by causal intervention using the -operator [40]. To appreciate the merits of the calculus on the right-hand side of Eq. (1), we need to understand the following two points. First, is domain-agnostic as is separated from and given  [43], thus generalizes to in testing even if it is trained on . Second, every stratum of is fairly adjusted subject to the domain prior , , forcing the model to respect “shape” as it is the only “invariance” that distinguishes between “speaker” and “bed” in controlled “cluttered” or “pure” backgrounds. Therefore, the semantic loss is eliminated.

However, Eq. (1) is purely declarative and far from practical, because it is still unknown how to implement the stratification and representation of the unobserved , not mentioning to transport it across domains. To this end, we propose a novel approach to make Eq. (1) practical in UDA:

Identify the number of by disentangled causal mechanisms. In practice, the number of can be large due to the combinations of multiple attributes, making Eq. (1) computationally prohibitive (, many network passes). Fortunately, if the attributes are disentangled, we can stratify according to the much smaller number of disentangled attributes, as the effect of each feature is independent with each other [18], , if and are disentangled. As detailed in Section 3.1, this motivates us to discover a small number of Disentangled Causal Mechanisms (DCMs) in unsupervised fashion [39], each of which corresponds to a feature-specific intervention between and , , mapping a real-world “speaker” image to its clip-part counterpart by changing “Background” from “cluttered” to “pure”.

Represent by proxy variables

. Yet, DCMs do not provide vector representations of

, leaving difficulties in implementing Eq. (1

) as neural networks. In Section 

3.2, we show that the transformed output of each DCM taking as input can be viewed as a proxy of  [36], who provides a theoretical guarantee that we can replace the unobserved with the observed to make Eq. (1) computable.

Overall, instead of transporting the abstract and unobserved in the original transportability theory of Eq. (1), we transport the concrete and learnable disentangled causal mechanisms, which generate the observed proxy of . Therefore, we term the approach as Transporting Casual Mechanisms (TCM) for UDA. Through extensive experiments, our approach achieves state-of-the-art performance on UDA benchmarks: 70.7% on Office-Home [57], 90.5% on ImageCLEF-DA [20] and 75.8% on VisDA-2017 [45]. Specifically, we show that learning disentangled mechanisms and capturing the causal effect through proxy variables are the keys to improve the performance, which validates the effectiveness of our approach.

2 Related Work

Existing UDA works fall into the three categories: 1) Sample Reweighing [54, 9]. This line-up adopts the covariate shift assumption. It first models and . When minimizing the classification loss, each sample in is assigned an importance weight , , a target-like sample has a larger weight, which effectively encourages . 2) Domain Mapping [19, 37]. This approach focuses on the conditional shift assumption. It first learns a mapping function that transforms samples from to

through unpaired image-to-image translation techniques such as CycleGAN 

[67]. Then, the transformed source domain samples are used to train a classifier. 3) Invariant Feature Learning. This is the most popular approach in recent literature [14, 31, 59]. It maps the samples in and to a common feature space, denoted as domain , where the source and target domain samples are indistinguishable. Some works [38, 30] adopted the covariate shift assumption, and aim to minimize the differences in using distance measures like Maximum Mean Discrepancy [56, 33] or using adversarial training to fool a domain discriminator [14, 5]. Others used the conditional shift assumption and aim to align  [32, 31]. As the target domain samples have no labels, their

is either estimated through clustering 

[21, 62] or using a classifier trained with the source domain samples [50, 63].

They all aim to make the source and target domain alike while learning a classifier with the labelled data in the source domain, leading to the semantic loss. In fact, there are existing works that attempt to Alleviate Semantic Loss by learning to capture in unsupervised fashion: One line exploits that generates (via ) and learns a latent variable to reconstruct  [5, 15]. The other line [7, 10] exploits , and aims to learn a domain-specific representation for each . However, this leads to an ever-illusive trade-off between the adversarial discrimination and reconstruction objectives. Our approach aims to fundamentally eliminate the semantic loss by providing a practical implementation of the transportability theory [41] based on causal intervention [40].

3 Approach

Though Eq. (1) provides a principled solution for DA, it is impractical due to the unobserved . To this end, the proposed TCM is a two-stage approach to make it practical. The first stage identifies the stratification by discovering Disentangled Causal Mechanisms (DCMs) ( Secton 3.1) and the second stage represents each stratification of with the proxy variable generated from the discovered DCMs (Section 3.2). The training and inference procedures are summarized in Algorithm 1 and 2, respectively.

1:  Input: Labelled domain , unlabelled domain , pre-trained backbone parameters
2:  Output: DCMs , fine-tuned backbone parameters and linear functions parameters
3:  Randomly initialize , VAE parameters , discriminator parameters (see Eq. (8))
4:  repeat
5:     // See Section 3.1 for details
6:     Sample randomly from or
7:     Calculate for
8:     Update with Eq. (2)
9:  until convergence
10:  repeat
11:     // See Section 3.2 for details
12:     Sample from , from
13:     Obtain ,
14:     Calculate with Eq. (7)
15:     Obtain with Eq. (5)
16:     Update with Eq. (8)
17:  until convergence
Algorithm 1 Two-stage Training of TCM
1:  Input: , , DCMs , backbone parameters and linear functions parameters
2:  Output: Predicted label
3:  Obtain
4:  Calculate with Eq. (7)
5:   with Eq. (5)
Algorithm 2 Inference with TCM

3.1 Disentangled Causal Mechanisms Discovery

Physicists believe that real-world observations are the outcome of the combination of independent physical laws. In causal inference [18, 55], we denote these laws as disentangled generative factors, such as shape, color, and position. For example, in our causal graph of Figure 2, if the semantic attribute can be disentangled into generative causal factors , each one will independently contribute to the observation: . Even though is large, , the number of sum in Eq. (1) is expensive, the disentanglement can still significantly reduce the number to a small .

Figure 3: (a) Our DCMs , where each and correspond to intervention on the disentangled attribute . (b) CycleGAN [67] loss on each .

However, learning a disentangled representation of without supervision is challenging [29]. Fortunately, we can observe the outcomes of : the source domain samples and the target domain samples generated from . This motivate us to discover pairs of end-to-end functions in unsupervised fashion, where and , as shown in Figure 3 (a). Each corresponds to a counterfactual mapping that transforms a sample to the counterpart domain by intervening a disentangled factor , , modifying the domain shift while fixing the values of other factors —this is essentially the definition of independent causal mechanisms [39], each of which corresponds to a disentangled causal factor. Hence we refer to these mapping functions as DCMs.

We first make a strong assumption that the one-to-one correspondence between and has been already established, then we relax this assumption by removing the correspondence for the proposed practical solution. Without loss of generality, we consider a source domain sample . For each , we obtain a transformed sample , corresponding to the interventional outcome of drawn from to . To ensure that the interventions are disentangled, we use the Counterfactual Faithfulness theorem [2, 60], which guarantees that is a disentangled intervention if and only if (proof in Appendix). Hence, as shown in Figure 3 (b), we use the CycleGAN loss [67] denoted as to make the transformed samples from each indistinguishable from the real samples in . Likewise, we apply the CycleGAN loss to each such that the transformed samples are close to real samples in .

Figure 4: Transformation between “Art” and “Clipart” domain with 4 trained DCMs , where each pair of mapping functions specializes in brightness, exposure, color temperature and sharpness, respectively.

Now we relax the above one-to-one correspondence assumption. We begin with a sufficient condition (proof in Appendix): if intervenes , the -th mapping function outputs the counterfactual faithful generation, , . To “guess” the one-to-one correspondence, we adopt a practical method by using the necessary conclusion: if , then corresponds to intervention. Specifically, training samples are fed into all in parallel to compute for each pair. Only the winning pair with the smallest loss is updated. The objective is given by:

(2)

The functions (, ) in the optimization objective denote their parameters for simplicity. Note that this is not sufficient, hence our approach has limitations. Still, this practice has been empirically justified in [39], and we leave a more theoretically complete approach as future work.

After training, we obtain DCMs , where the -th pair corresponds to . Hence we identify the number of as . Figure 4 shows an example with , where the 4 DCMs correspond to “Brightness”, “Exposure”, “Temperature” and “Sharpness”, respectively222Note that our DCM learning is orthogonal to the GAN models used. One can use more advanced GAN models to discover complex mechanisms such as shape or viewpoint changes..

3.2 Representing by Proxy Variables

Figure 5: Causal graph with the proxy and .

While the learned DCMs identify the number of , they do not provide a vector representation of . Hence and in Eq. (1) are still unobserved. Fortunately, generates two observed variables and as shown in Figure 5, which are called proxy variables [36] of and make and estimable. We will first explain and as well as their related causal links.

. represents the DCMs outputs, , for a sample in , takes value from ; and for a sample in , takes value from . As detailed in Appendix, generates counterfactuals in the other domain by fixing and intervening , , is generated from . Hence is justified. Moreover, as the other domain is also labelled, denotes the predictive effects from the generated counterfactuals.

. is a latent variable encoded from by a VAE [23]. The link is because the latent variable is trained to reconstruct . Furthermore, the latent variable of VAE is shown to capture some information of the underlying  [55], justifying .

Note that the networks to obtain and are trained in an unsupervised fashion. Hence are observed in both and . Under this causal graph, we have a theoretically grounded solution to Eq. (1) using the theorem below (proof in Appendix as corollary to [36]).

Theorem (Proxy Function). Under the causal graph in Figure 5, any solution to Eq. (3) satisfies Eq. (4).

(3)
(4)

Notice that we can set in Eq. (3) and use the labelled data in to learn (details given below). We prove in the Appendix that is invariant across domains. This leads to the following inference strategy.

3.2.1 Inference

Taking expectation over on both sides of Eq. (4) leads to a practical solution to Eq. (1):

(5)

where is estimable as is observed in , and is trained in as given below.

3.2.2 Learning

We detail the training of , including the representation of its inputs and , the function forms of the terms in Eq. (3) and the training objective.

Representation of . When and are images, directly evaluating Eq. (5) in image space can be computationally expensive, not to mention the tendency for severe over-fitting. Therefore, we follow the common practice in UDA [31, 21] to introduce a pre-trained feature extractor backbone (, ResNet [17]). Hereinafter, denote their respective -d features. Note that of dimensions is encoded from ’s feature form ().

Function Forms

. We adopt an isotropic Gaussian distribution for

. We implement in Eq. (3) with the following linear model and , respectively:

(6)

where

produces logits for

classes, predicts , , , , , and . With this function form, we can solve from Eq. (3) as (derivation in Appendix):

(7)

where denotes pseudo-inverse.

Overall Objective. The training objective is given by:

(8)

where denotes the parameters of the linear functions and , denotes the parameters of the backbone, denotes the parameters of the VAE, denotes the sum of Cross-Entropy loss to train and the mean squared error loss to train (see Eq. (6)). is the VAE loss, is the proxy loss, denotes the parameters of the discriminators used in , and is a trade-off parameter.

VAE Loss. is used to train the VAE, which contains an encoder and a decoder . Given a feature in , is given by:

(9)

where denotes KL-divergence and is set to the isotropic Gaussian distribution. Note that we only need to learn a VAE in , as is domain-agnostic.

Proxy Loss. In practice, the generated images from the DCMs may contain artifacts that contaminate the feature outputs from the backbone [34], making as feature no longer resembles the features in the counterpart domain. Hence we regularize the backbone with a proxy loss to enforce feature-level resemblance between the samples transformed by the DCMs and the samples in the counterpart domain. Given feature in , in and their corresponding DCMs outputs sets (as -d features), the loss is given by:

(10)

where are discriminators parameterized by that return a large value for features of real samples. Through min-max adversarial training, generated sample features become similar to the sample features in the counterpart domain and fulfill the role as a proxy.

4 Experiment

4.1 Datasets

Method AC AP AR CA CP CR PA PC PR RA RC RP Avg
ResNet-50 [17] 34.9 50.0 58.0 37.4 41.9 46.2 38.5 31.2 60.4 53.9 41.2 59.9 46.1
DAN [30] 43.6 57.0 67.9 45.8 56.5 60.4 44.0 43.6 67.7 63.1 51.5 74.3 56.3
DANN [14] 45.6 59.3 70.1 47.0 58.5 60.9 46.1 43.7 68.5 63.2 51.8 76.8 57.6
MCD [51] 48.9 68.3 74.6 61.3 67.6 68.8 57.0 47.1 75.1 69.1 52.2 79.6 64.1
MDD [64] 54.9 73.7 77.8 60.0 71.4 71.8 61.2 53.6 78.1 72.5 60.2 82.3 68.1
CDAN [31] 50.7 70.6 76.0 57.6 70.0 70.0 57.4 50.9 77.3 70.9 56.7 81.6 65.8
SymNets [63] 47.7 72.9 78.5 64.2 71.3 74.2 63.6 47.6 79.4 73.8 50.8 82.6 67.2
CADA [25] 56.9 76.4 80.7 61.3 75.2 75.2 63.2 54.5 80.7 73.9 61.5 84.1 70.2
ETD [26] 51.3 71.9 85.7 57.6 69.2 73.7 57.8 51.2 79.3 70.2 57.5 82.1 67.3
GVB-GD [11] 57.0 74.7 79.8 64.6 74.1 74.6 65.2 55.1 81.0 74.6 59.7 84.3 70.4
Baseline 54.3 70.1 75.2 60.4 71.5 72.3 60.1 52.1 74.2 71.5 56.2 81.3 66.6
TCM (Ours) 58.6 74.4 79.6 64.5 74.0 75.1 64.6 56.2 80.9 74.6 60.7 84.7 70.7
Table 1: Accuracy (%) on the Office-Home dataset [57] with 12 UDA tasks, where all methods are fine-tuned from ResNet-50 [17]

pre-trained on ImageNet 

[48].
Method I P P I I C C I C P P C Avg
ResNet-50 [17] 74.8 0.3 83.9 0.1 91.5 0.3 78.0 0.2 65.5 0.3 91.2 0.3 80.7
DAN [30] 74.5 0.4 82.2 0.2 92.8 0.2 86.3 0.4 69.2 0.4 89.8 0.4 82.5
DANN [14] 75.0 0.6 86.0 0.3 96.2 0.4 87.0 0.5 74.3 0.5 91.5 0.6 85.0
RevGrad [13] 75.0 0.6 86.0 0.3 96.2 0.4 87.0 0.5 74.3 0.5 91.5 0.6 85.0
MADA [44] 75.0 0.3 87.9 0.2 96.0 0.3 88.8 0.3 75.2 0.2 92.2 0.3 85.8
CDAN [31] 77.7 0.3 90.7 0.2 97.7 0.3 91.3 0.3 74.2 0.2 94.3 0.3 87.7
ALP [62] 79.6 0.3 92.7 0.3 96.7 0.1 92.5 0.2 78.9 0.2 96.0 0.1 89.4
SymNets [63] 80.2 0.3 93.6 0.2 97.0 0.3 93.4 0.3 78.7 0.3 96.4 0.1 89.9
ETD [26] 81.0 91.7 97.9 93.3 79.5 95.0 89.7
Baseline 77.2 0.5 89.7 0.2 96.1 0.3 92.1 0.4 75.2 0.3 93.5 0.4 87.3
TCM (Ours) 79.9 0.4 94.2 0.2 97.8 0.3 93.8 0.4 79.9 0.4 96.9 0.4 90.5
Table 2:

Accuracy (%) and the standard deviation on the ImageCLEF-DA dataset 

[20] with 6 UDA tasks, where all methods are fine-tuned from ResNet-50 [17]

pre-trained on ImageNet 

[48].

We validated TCM on three standard benchmarks for visual domain adaptation:

ImageCLEF-DA [20]

is a benchmark dataset for ImageCLEF 2014 domain adaptation challenge, which contains three domains: 1) Caltech-256 (C), 2) ImageNet ILSVRC 2012 (I) and 3) Pascal VOC 2012 (P). For each domain, there are 12 categories and 50 images in each category. We permuted all the three domains and built six transfer tasks, , I

P, PI, IC, CI, CP, PC.

Office-Home [57] is a very challenging dataset for UDA with 4 significantly different domains: Artistic images (A), Clipart (C), Product images (P) and Real-World images (R). It contains 15,500 images from 65 categories of everyday objects in the office and home scenes. We evaluated TCM in all 12 permutations of domain adaptation tasks.

VisDA-2017 [45] is a challenging simulation-to-real dataset that is significantly larger than the other two datasets with 280k images in 12 categories. It has two domains: Synthetic, with renderings of 3D models from different angles and with different lighting conditions; and Real with natural real-world images. We followed the common protocol [31, 11] to evaluate on SyntheticReal task.

4.2 Setup

Evaluation Protocol. We followed the common evaluation protocol for UDA [31, 63, 11], where all labeled source samples and unlabeled target samples are used to train the model, and the average classification accuracy is compared in each dataset based on three random experiments. Following [63, 62], we reported the standard deviation on ImageCLEF-DA. For fair comparison, our TCM and all comparative methods used the backbone ResNet-50 [17] pre-trained on ImageNet [48].

Implementation Details. We implemented each in DCMs as an encoder-decoder network, consisting of 2 down-sampling convolutional layers, followed by 2 ResNet blocks and 2 up-sampling convolutional layers. The loss for training DCMs consists of the adversarial loss, cycle consistency loss, and identity loss following the official code of CycleGAN. The encoder and decoder network in VAE were each implemented with 2 fully-connected layers. The min-max objective in Eq. (8) for the proxy loss was implemented using the gradient reversal layer [13]. The number of DCMs

is a hyperparameter. We used

for all Office-Home experiments and the VisDA-2017 experiment and conducted ablation on with ImageCLEF-DA.

Baseline. As existing domain-mapping methods either focus on the image segmentation task [27], or only evaluate on toy settings (, digit) [19], we implemented a domain-mapping baseline. Specifically, we trained a CycleGAN that transforms and , whose network architecture is the same as each DCM in our proposed TCM. Then we learned a classifier on the transformed samples with the standard cross-entropy loss, while using the loss in Eq. (10) to align the features of the transformed samples with the target domain features. In inference, we directly used the learned classifier.

4.3 Results

Overall Results. As shown in Table 123, our method achieves the state-of-the-art average classification accuracy on Office-Home [57], ImageCLEF-DA [20], and VisDA-2017 [45]. Specifically, 1) ImageCLEF-DA has the smallest domain gap, where the 3 domains correspond to real-world images from 3 datasets. Our TCM outperforms existing methods on 4 out of 6 UDA tasks. 2) Office-Home has a much larger domain gap, such as the Artistic domain (A) and Clipart domain (C). Besides improvements in average accuracy, our method significantly outperforms existing methods on the most difficult tasks, , on AC, on PC. 3) VisDA-2017 is a large-scale dataset with 280k images, where TCM performs competitively. Note that the training complexity and convergence speed of TCM are comparable with existing methods (details in Appendix). Hence overall, our TCM is generally applicable to small to large-scale UDA tasks with various sizes of domain gaps.

Comparison with Baseline. We have three observations: 1) From Table 123, we notice that the Baseline does not perform competitively on 3 benchmarks. This shows that the improvements from TCM are not the results of superior image generation quality from CycleGAN. 2) Baseline accuracy in Table 2 is much lower compared with our TCM with a single DCM () in Table 4 on ImageCLEF-DA. The only difference between them is that Baseline train a classifier directly with the transformed sample , while TCM() uses the proxy function in Eq. (5). Hence this validates that the proxy theory in Section 3.1 has some effects in removing the confounding effect, even without stratifying . As shown later, this is because can make up for the lost semantic in . 3) On all 3 benchmarks, when using multiple DCMs, TCM significantly outperforms Baseline, which validates our practical implementation of Eq. (5) by combining DCMs and proxy theory to identify the stratification and representation of .

Method Acc. DAN [30] 61.6 DANN [14] 57.4 GTA [52] 69.5 MDD [64] 74.6 CDAN [31] 70.0 GVB-GD [11] 75.3 DMRL [58] 75.5 Baseline 72.8 TCM (Ours) 75.8 Table 3: Accuracy (%) of SyntheticReal task on the VisDA-2017 dataset  [45].
Figure 6: Plot of against Accuracy (%) on the 6 UDA tasks from ImageCLEF-DA [20].
I P P I I C C I C P P C Avg
1 77.6 91.6 96.7 92.3 78.4 95.3 88.7
2 79 93.6 97.5 93.3 79.2 96.6 89.9
3 79.9 94.2 97.6 93.7 79.6 96.5 90.3
5 79.7 93.9 97.8 93.8 79.8 96.9 90.3
7 79.5 94.1 97.7 93.6 79.9 96.6 90.2
10 79.2 93.6 97.5 93.5 79.6 96.4 90.0
Table 4: Ablation on the number of DCMs with accuracy (%) on the 6 UDA tasks from ImageCLEF-DA [20].
Figure 7: t-SNE [35] plot of features in (red) and in (blue) after training using method: (a) GVB-GD [11], (b) Our TCM. Only samples from the first 15 classes are plotted to avoid clutter.
Figure 8: Class Activation Maps (CAMs) [66] of GVB-GD [11], Baseline and TCM on the ArtisticClipart task in Office-Home dataset [57]. The left column shows two samples in each domain, whose class name is indicated on the left. In , we show two samples predicted wrongly by GVB-GD and Baseline, but correctly with TCM.

Number of DCMs. We performed an ablation on the number of DCMs with ImageCLEF-DA dataset. The results are shown in Table 4 and plotted in Figure 6. We have three observations: 1) Among each step of , the largest difference occurs between and . This means that accounting for 2 disentangled factors can explain much of the domain shift due to in ImageCLEF-DA. 2) Across all tasks, the best performance is usually achieved with . This supports the effectiveness of stratifying according to a small number of disentangled attributes. 3) A large (, 10) leads to a slight performance drop. In practice, we observed that when is large, a few DCMs seldom win any sample from Eq. (2), and hence are poorly trained. This can impact their image generation quality, leading to reduced performance. We argue that this is because the chosen is larger than the number of factors that can be disentangled from the dataset. For example, if there exist factors , once DCMs establish correspondence with them, the rest DCMs will never win any sample, as they can never generate images that are more counterfactual faithful compared with intervention on each (see Counterfactual Faithfulness theorem in Appendix).

t-SNE Plot. In Figure 7, we show the t-SNE [35] plots of the features in and in Office-Home AC task after training, for both the current SOTA GVB-GD [11] and our TCM. GVB-GD is based on invariant-feature-learning and focuses on making and alike, while our TCM does not explicitly align them. We indeed observe a better alignment between the features in and in GVB-GD. However, as shown in Table 1, our method outperforms GVB-GD on AC. This shows that the current common view on the solution to UDA based on [1], , making domains alike while minimizing source risk, does not necessarily lead to the best performance. In fact, this is in line with the conclusion from recent theoretical works [65]. Our approach is orthogonal to existing approaches and we demonstrate the practicality of a more principled solution to UDA, , establishing appropriate causal assumption on the domain shift and solving the corresponding transport formula (, Eq. (1)).

Alleviating Semantic Loss. We use Figure 8 to reveal the semantic loss problem on existing methods, and how our method can alleviate it. The figure shows the CAM [66] responses on images in and from the task in Office-Home dataset. We compared with two lines of existing approaches: invariant-feature-learning method (GVB-GD [11]) and domain-mapping method (our Baseline), and we generated CAM for each of them as shown in the red dotted box. For our TCM, the prediction is a linear combination of the effects from the two images: and the DCMs outputs (see Eq. (7)). As the locations of objects tend to remain the same in as in (see Figure 4), we visualize the overall CAM of our TCM in the blue box by combining the CAM from and

weighted by their contributions towards softmax prediction probability. On the left, we show the CAMs in

. By looking at the CAM of GVB-GD and the Baseline, we observe that they focus on the contextual object semantics (, food that commonly appears together with “fork”) to distinguish “marker” and “fork”, where the effect from the object shape is mostly lost. In contrast, our TCM focuses on the marker and fork, , the shape semantic is preserved in training. On the right in , the contextual object semantic is no longer discriminative for the two classes. Hence GVB-GD and Baseline either focus on the wrong semantic (, plate) or become confused (focusing on a large area), leading to the wrong prediction. Thanks to the preserved shape semantic, our TCM focuses on the object and makes the correct prediction. This provides an intuitive explanation of how our TCM alleviates the semantic loss (more examples in Appendix).

5 Conclusion

We presented a practical implementation of the transportability theory for UDA called Transporting Causal Mechanisms (TCM). It systematically eliminates the semantic loss from the confounding effect. In particular, we proposed to identify the confounder stratification by discovering disentangled causal mechanisms and represent the unknown confounder representation by proxy variables. Extensive results on UDA benchmarks showed that TCM is empirically effective. It is worth highlighting that TCM is orthogonal to the existing UDA solutions—instead of making the two domains similar, we aim to find how to transport and what can be transported with a causality theoretic viewpoint of the domain shift. We believe that the key bottleneck of TCM is the quality of causal mechanism disentanglement. Therefore, we will seek more effective unsupervised feature disentanglement methods and investigate their causal mechanisms.

6 Acknowledgements

The authors would like to thank all reviewers and ACs for their constructive suggestions, and specially thank Alibaba City Brain Group for the donations of GPUs. This research is partly supported by the Alibaba-NTU Singapore Joint Research Institute, Nanyang Technological University (NTU), Singapore; A*STAR under its AME YIRG Grant (Project No. A20E6c0101); and the Singapore Ministry of Education (MOE) Academic Research Fund (AcRF) Tier 2 grant.

Appendix

This appendix is organized as follows:

  • For preliminaries on structural causal model and do-calculus, we refer readers to Section 2 of [41].

  • Section A.1 gives the proofs and derivations for Section 3, where we first prove the Counterfactual Faithfulness theorem in Section A.1.1, and then prove the sufficient condition used to establish the correspondence between DCMs and disentangled generative causal factors in Section A.1.2.

  • Section A.2 the proofs and derivations for Section 4, where we prove the Proxy Function theorem and its corollary in Section A.2.1, and then derive Eq. (7) in Section A.2.2.

  • Section A.3 provides implementation details. Specifically, in Section A.3.1, we provide the network architecture of DCMs, the implementation of CycleGAN loss and DCM training details. In Section A.3.2, we show the network architectures of our backbone, the VAE and discriminators, together with their training details. In Section A.3.3, we attend to some details in the experiment.

  • Section A.4 shows additional generated images from our DCMs and additional CAM results.

a.1 Proof and Derivation for Section 3

In this section, we will first derive the Counterfactual Faithfulness theorem. Then we will prove the sufficient condition in Section 3.1.

a.1.1 Counterfactual Faithfulness Theorem

We will first provide a brief introduction to the concept of counterfactual and disentanglement. Causality allows to compute how an outcome would have changed, had some variables taken different values, referred to as a counterfactual. In Section 3.1, we refer to each DCM in as a counterfactual mapping, where each (or ) essentially follows the three steps of computing counterfactuals [42] (conceptually): given a sample , 1) In abduction, is inferred from through ; 2) In action, the attribute is intervened by setting it to drawn from (or ), while the values of other attributes are fixed; 3) In prediction, the modified is fed to the generative process to obtain the output of the DCM (or ). More details regarding counterfactual can be found in [40].

Our definition of disentanglement is based on [18] of group theory. Let be a set of (unknown) generative factors, , such as shape and background. There is a set of independent causal mechanisms , generating images from . Let be the group acting on , , transforms using (, changing background “cluttered” to “pure”). When there exists a direct product decomposition and such that acts on , we say that each is the space of a disentangled factor. The causal mechanism is disentangled when its transformation in corresponds to the action of on .

We use and to denote the vector space of and respectively. We denote the generative process () as a function . Note that we consider the function as an embedded function [2], , a continuous injective function with continuous inversion, which generally holds for convolution-based networks as shown in [46]. Without loss of generality, we will consider the mapping for the analysis below, which can be easily extended to . Our definition of disentangled intervention follows the intrinsic disentanglement definition in [2], given by:

Definition (Disentangled Intervention). A counterfactual mapping is a disentangled intervention with respect to , if there exists a transformation affecting only , such that for any ,

(11)

Then we have the following theorem:

Theorem (Counterfactual Faithfulness Theorem). The counterfactual mapping is faithful if and only if is a disentangled intervention with respect to .

Note that by definition, if is faithful, . To prove the above theorem, one conditional is trivial: if is a disentangled intervention, it is by definition an endomorphism of so the counterfactual mapping must be faithful. For the second conditional, let us assume a faithful counterfactual mapping . Given is embedded, the counterfactual mapping can be decomposed as:

(12)

where denotes function composition and affecting only . Now for any , the quantity can be similarly decomposed as:

(13)

Since is a transformation in that only affects , we show that faithful counterfactual transformation is a disentangled intervention with respect to , hence completing the proof.

With this theory, faithfulnessdisentangled intervention. In Section 3.1, we train to ensure (faithfulness) for every sample in , hence encouraging to be a disentangled intervention. Note that the above analysis can easily generalize to .

a.1.2 Sufficient Condition

We will prove the following sufficient condition: if intervenes , the -th mapping function outputs the counterfactual faithful generation, , the smallest .

Without loss of generality, we will prove for the mapping , which can be extended to . For a sample in , let . We modify by changing to a value drawn from . Denote the modified attribute as . Denote the sample with attribute as . Given intervenes , corresponds to a counterfactual outcome when is set to through intervention (or set as ). Now as , using the counterfactual consistency rule [43], we have . As is faithful with the Counterfactual Faithfulness theorem, we prove that , , the output of the -th mapping function, is also faithful,, the smallest .

a.2 Proof and Derivation for Section 4

In this section, we will first derive the Proxy Function theorem and the domain-agnostic nature of the proxy function, and then derive Eq. (7) under our chosen function forms in Section 4.

a.2.1 Proxy Function Theorem

We will derive for the general case where is any continuous proxy. We will assume that the confounder follows the completeness condition in [36]

, which accommodates most commonly-used parametric and semi-parametric models such as exponential families.

Given solves Eq. (3), we have:

(14)

From the law of total probability, we have:

(15)

With Eq. (14) and Eq. (15) and the completeness condition, we have:

(16)

which proves the Proxy Function theorem.

From Eq. (16), we have:

(17)

Hence from the completeness condition, we have  [36]. Hence we prove that is domain-agnostic.

Note that in Section 4, our proxy

is a continuous random variable takes values from

for sample in or for sample in . This is a special case of the analysis above with the probability mass of centers around the set of its possible values.

a.2.2 Derivation of Eq. (7)

We derive Eq. (7) as a corollary to [36]. The goal is to solve for under the function form in Eq. (6) from the formula below:

(18)

For simplicity, we define a standard multivariate Gaussian function . If and , we have:

(19)

Our function form for is given by and for is given by

, where the variance terms are omitted in the main text for brevity, as the final results only depend on the means. Specifically,

is a symmetrical matrix with Eigen-decomposition given by , where is a full-rank matrix containing the eigen-vectors and is a diagonal matrix with eigen-values. We define , and hence . We can rewrite as:

(20)

where . Define and . We define

(21)

Now we can solve from

(22)

Specifically, let and

represent the Fourier transform of

and , respectively:

(23)