Learning Domain Invariant Representations by Joint Wasserstein Distance Minimization

06/09/2021 ∙ by Léo Andéol, et al. ∙ 59

Domain shifts in the training data are common in practical applications of machine learning, they occur for instance when the data is coming from different sources. Ideally, a ML model should work well independently of these shifts, for example, by learning a domain-invariant representation. Moreover, privacy concerns regarding the source also require a domain-invariant representation. In this work, we provide theoretical results that link domain invariant representations – measured by the Wasserstein distance on the joint distributions – to a practical semi-supervised learning objective based on a cross-entropy classifier and a novel domain critic. Quantitative experiments demonstrate that the proposed approach is indeed able to practically learn such an invariant representation (between two domains), and the latter also supports models with higher predictive accuracy on both domains, comparing favorably to existing techniques.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning from data that originates from different sources representing the same physical observations occurs rather commonly, but it is nevertheless a highly challenging endeavor. These multiple data sources may e.g. originate from different users, acquisition devices, geographical locations, they may encompass batch effects in biology, or they may come from the same measurement devices that each are calibrated differently. Because the source of the data itself is typically not task-relevant, a learned model is therefore required to be invariant across domains. A valid strategy for achieving this is to learn an invariant intermediate representation (illustrated in Figure 1). Furthermore, in certain applications, privacy requirements such as anonymity dictate that the source should not be recoverable from the representation. Hence, building a domain invariant representation can also be a desideratum by itself.

Domain invariance, in some contexts referred to as sub-population shift [26] or distributional shifts [1, 11, 44, 48, 14, 13, 18], can be contrasted to two related and well-researched areas that are domain adaptation [53, 54, 5, 9] and domain generalization [6, 40, 30, 31, 12, 57, 58]. In domain adaptation, we are mainly concerned about the model performance on the (unlabeled) target domain, often at the expense of incurring more errors on the (labeled) source domain. Domain generalization aims to build a model that generalizes across all domains, including unseen ones but all samples are fully labeled. There are few works in domain adaptation [8, 33, 49, 21, 24] and domain generalization [51]

that tackled a more realistic and flexible approach where all domains are partially labeled, and none of them to our knowledge have a rigorous theoretical background. That is due to the inherent difficulty of studying partially labeled probability distributions. Furthermore, the generality imposes additional constraints on the solution, that can hamper the careful enforcement on the invariance on the domains at hand. Hence, domain invariance, our focus in this paper, addresses a singular and important problem, which has so far received little attention, especially in the context of deep learning models, and there is no work that we are aware of that considers partially labeled domains.

Figure 1: Illustration of the problem of domain invariance. We would like to learn a function that maps the data to a representation where the domains cannot be differentiated, and from which a domain-invariant classifier can be built. The invariant representation induced by this model can serve further purposes such as domain privacy or extraction of domain-related insights.

In order to address domain invariance, we consider the Wasserstein distance [43, 55] in the present work, as it characterizes the weak convergence of measures and displays several advantages as discussed in [2]. We contribute several bounds relating the Wasserstein distance between the joint distributions of (here two) domains and upon those construct an objective function of practical domain invariant networks based on standard cross-entropy classifiers and a newly proposed, theory-based domain discriminator111Anecdotally, the use of a domain discriminator makes our method relate to works on domain adaptation such as DANN [16] and WDGRL [52].. A significant part of the novelty of our work lies in the different formalism and problem setting that we have adopted, allowing us to establish theory for formally studying partially labeled distributions. In our framework, however, the classifiers can be trained either in a fully supervised or in a semi-supervised manner. With the proposed theoretical grounding, the studied classifiers can not only achieve high classification accuracy on the task of interest, their successful domain invariant training mechanistically lowers the Wasserstein distance between the two domains, thereby inducing additional desirable side-effects such as improving domain privacy, or exposing the joint domain subspace, that can subsequently mined for new insights.

Our proposed model is analyzed on several domain invariance benchmark tasks (MNIST vs. SVHN, and the multi-domain PACS dataset). In all cases, our novel framework is able to simultaneously achieve

high classification accuracy on all domains and high invariance of the representation. Lastly, we conduct an inspection of the learned invariant representation using UMAP embeddings [35] and ‘explainable AI’ (cf. [47, 50]). Our inspection highlights across the whole dataset the overall qualities of our proposed learned invariant representation. We also explore whether building such an invariant representation requires to look at the intersection of features of the two domains [32]. We find interestingly that recognizing and exploiting domain-specific

features remains in fact an integral part of the neural network strategy to arrive at the desired invariant representation.

2 Domain Invariance and Optimal Transport

Domain invariance can be described as the property of a representation to be indistinguishable with regards to its original domain, in particular, the multiple data distributions projected in representation space should look the same (i.e. have low distance). A recently popular way of measuring the distance between two distributions is the Wasserstein distance. The latter can be interpreted as the cost of transporting the probability mass of one distribution to the other if we follow the optimal transport plan, and it can be formally defined as follows:

Definition 1.

Let be two arbitrary probability distributions defined over two measurable metric spaces and . Let be a cost function. Their Wasserstein distance is:

(1)

with , where and are push-forwards of the projection of and . This can be loosely interpreted as being the set of joint distributions that have marginals and .

Hence, we measure the invariance of representations by how low the Wasserstein distance is between the distributions and

associated to the two domains. In comparison to other common alternatives such as the Kullback-Leibler divergence, the Jensen-Shannon divergence, or the Total Variation distance, the Wasserstein distance has the advantage of taking into account the metric of the representation space (via the cost function

), instead of looking at pure distributional overlap, and this typically leads to better ML models [39, 2]. Computing the Wasserstein distance with Eq. (1) is expensive. Luckily, if we use the metric of our space as a cost function, such as the Euclidean distance , we can derive a dual formulation of the 1-Wasserstein distance as follows:

(2)

This formulation replaces an explicit computation of a transport plan, by a function to estimate, a task particularly appropriate for neural networks. Recently several methods have used this approach to learn distributions

[39]

specifically in the context of Generative Adversarial Networks

[2, 52]. The main constraint lies in the necessity of the function

, which we will call the discriminator, to be 1-Lipschitz. A few approaches were proposed to tackle this problem, such as gradient clipping

[2], gradient penalty [20] and more recently the Spectral Normalization [36]. It is however important to note that in practice the set of possible discriminators will be a subset of 1-Lipschitz continuous functions.

3 Learning Domain Invariant Representations

This section focuses on the core question of our work, which is how to build a domain invariant representation. We start by some notation: We denote by our representation or feature space, by our label or target space, both assumed to be compact measurable spaces, and by the set of probability distributions defined on their product space. Consider we have two true probability distributions . When necessary, we add a subscript to these distributions to specify their support. We now want to align these distributions, i.e. we would like our classifier to be invariant to them. Similarly to previous works, we consider the Wasserstein distance of samples embedded in feature space, and we will therefore want to learn a mapping

which can take the form of a feature extraction function

followed by a classifier .
We note that an invariant representation is by itself very simple to obtain if we do not subject it to the requirement that it should be inductive and solve an actual machine learning task (e.g. classification). Our strategy for extracting the desired representation consists of starting with the Wasserstein distance , and upper-bound it in multiple stages, in a way that terms of the desired machine learning objective progressively appear, e.g. a classification function that can be optimized for maximum accuracy.—From a theoretical perspective, this allows us to demonstrate how flavors of the data, such as the availability of both supervised and unsupervised data, can be iteratively and flexibly worked into the upper-bound.
The individual steps are presented in Sections 3.13.3. (See Figure 2 for an overview of our novel theoretical framework.) Our novel approach draws some inspiration from [9] and is based on measure theory, in order to formalize partially labeled distributions and therefore our problem of aligning multiple joint distributions.

Figure 2: Illustration of the upper-bound relation between DNN classifier objective (red) and Wasserstein distance between the joint distributions of each domain (our object of interest). Semi-supervised empirical distributions are accessed from the true distribution via the triangle inequality. The expanded terms are upper-bounded by empirical components of a DNN that can be optimized.

3.1 Incorporating Semi-Supervised Data

Computation of the true Wasserstein distance would require knowledge of the true distributions and . In practice, we only have a finite sample of these distributions, and the quality with which the Wasserstein distance can be approximated largely depends on the amount of labeled data available. (For high-dimensional tasks, the necessary amount of labels would be overwhelming.) However, in practice, it is common that unlabeled data is available in much larger quantity than the labeled data. We consider this semi-labeled scenario where only a fraction are really sampled from (i.e. labeled). The remaining samples are unlabeled and obtained from the marginal , and by using a function that outputs labels given features, say, a DNN classifier, we obtain an estimate of the true joint distribution . Therefore the distribution we observe samples from, is a mixture . Identically for , . Note however that the and need not be identical, and , the proportion of labeled samples in may differ from . We start by simply applying the triangle inequality in order to make the distance between the partially labeled distributions appear.

Proposition 1.

Let the cost function be the metric on the space , we then have

(3)
Proof.

The 1-Wasserstein distance being a metric, we can apply the triangle inequality, and obtain the proposition by applying it twice. ∎

We observe that the upper-bound consists of multiple terms. First, and , the distances between observed and true joint distributions of respectively and . Second, , the distance between the observed distributions and . We analyze these quantities subsequently.

3.2 Incorporating a Cross-Entropy Classifier

Let us first analyse the distance between observed and true joint distributions . We will consider here the case of (analogously so for ). Intuitively, since is an element of the mixture , by computing the distance between the two, we partly compute the distance to itself. Indeed, in practice, the decomposition of the Wasserstein distance with respect to the mixture is an upper bound of our original distance, using the Jensen inequality on the supremum (the proof can be found in the Supplement). The result is summarized in the following lemma.

Lemma 1.

Let be an arbitrary probability distribution, we then have

(4)

Therefore, since the Wasserstein distance is symmetric, we can upper bound our distance as , since . However, this new formulation, which is the distance between true and estimated distribution, is intuitively a classification task. We therefore upper bound this distance using tools more common in the machine learning community, namely the Kullback-Leibler divergence, which is equivalent to the cross-entropy loss when is deterministic. The following lemma details this upper bound (for proof see Supplement).

Lemma 2.

Assuming that and admit densities, we then obtain

(5)

where is the Kullback-Leibler divergence; and where is the diameter of the space , i.e. the largest distance obtainable in that space. We have now obtained an upper bound of our original distance between and that leads to a classification task that is easily computable, therefore linking theory to practical applications, and common usages.

3.3 Adding a Class-Dependent Domain Critic

Concerning the distance between partially labeled distributions , we can compute its empirical version as the data we have is sampled from those distributions. In the practical case where we want to maximize domain invariance irrespectively of the qualitative value of the representations for machine learning, we may use this distance. However there is an inherent issue to using the common dual of the 1-Wasserstein distance (Eq. 2), which is in fact a specific case of the 1-Wasserstein dual. This specific case arises when the cost function used is the metric of our space, commonly, the Euclidean distance which entails

. In practice, the norm of our representation vectors

are usually larger than the one of the labels

(represented as one-hot vectors, as they are categorical variables) by several orders of magnitude. Optimal transport being oblivious to the objectives of machine learning, will focus on minimizing distance between samples irrespectively of their labels. We therefore explore a refinement of this term that is specific to the classification setting, and we obtain Lemma

3.

Lemma 3.

Assuming that admit densities, we then obtain

(6)

where we have a class-weighted joint distribution unbalanced Wasserstein distance defined as

This Wasserstein distance can be seen as a distance between joint distributions for a fixed value of . This lemma therefore indicates that we can decompose a Wasserstein distance on a joint distribution into a sum over class labels y of unbalanced Wasserstein distances on fixed-y joint distributions, i.e. a distance between . Practically, our discriminator network would have as many outputs as we have classes, and its output given a sample would be weighed by the conditional probability vector (one-hot vector for known labels, softmax output of the classifier otherwise). A more straightforward interpretation, is that we can minimize the Wasserstein distance between matching classes of each domain, independently. This general idea has been explored in the past in [31] but with a different approach and with stronger assumptions (e.g. a uniform class distribution for each domain).

3.4 Formulation of the Objective Function

By applying all the previous results to the original proposition, that is, by applying the triangle inequality and upper-bounding the expanded terms, we obtain an upper-bound on the Wasserstein distance composed exclusively of terms that we will be able to meaningfully incorporate in a practical machine learning model:

Theorem 1.

Given the cost function used is the metric on the product space :

(7)

with the subscripts being the marginal distributions.

This result informs us on how to build an objective function that solves the classification task and also minimizes the Wasserstein distance between the two true distributions. We will now derive it step by step. Let us first consider that our data consists of the samples and labels from both domains sampled from the distributions we referred to as “observed”. Formally, we have and where and are the number of samples obtained from each domain. We can now derive straightforwardly from the right hand side of Lemma 2 respectively and defined in their empirical version, assuming deterministic labels, as follows

(8)

Moreover, to define we need to introduce an additional notation: the discriminator , or domain critic, which corresponds to the function as it appears in the Eq. (2). However, as per Lemma 3, does not have one output, but one for each class, i.e. .

Finally, as we use a shared classifier for the domains and , we will denote it by . Lastly, we use to denote our feature extractor. We can therefore formulate the optimization problem as follows, by including the Lipschitzness constraint on the discriminator, as it is necessary to estimate the Wasserstein distance.

(9)

The Lipschitzness constraint is practically enforced by using one of the regularization techniques mentioned at the end of Section 2. Supplementary regularization terms, such as EntMin [19], Virtual Adversarial Training [37], and Virtual Mixup [34] can be added to the objective to take further advantage of the unlabeled examples. A visual representation of our model is given in Figure 3.

Figure 3: Diagram of the proposed DNN model that induces a domain-invariant representation through a domain critic. The domain critic can either be a single discriminator between the two distributions, or one discriminator per class (cf. Section 3.3).

3.5 Generalization Bounds

Interestingly, the Wasserstein distance between the true distributions of the two domains(that we have upper-bounded in Theorem 1) can also be related to the risks of the classifier on the two domains. Let be the risk or error of a classifier . We here develop a result using the joint Wasserstein distance, similar to previous a previous obtained by [46] on the distance between marginals.

Theorem 2.

Let be two compact measurable metric spaces whose product space has dimension and two joint distributions associated to two domains. Let the transport cost function associated to the optimal transport problem be , the Euclidean distance as the metric on and a symmetric

-Lipschitz loss function. Then for any

and there exists some constant depending on such that for any and with probability at least for all -Lipschitz the following holds:

(10)

In other words, the Wasserstein distance between the two domains upper-bounds the prediction performance gap between the two domains. In practice, we can therefore expect the optimization of the objective in Eq. (9) to not only reduce the Wasserstein distance between domains (as we have shown in the previous sections), but also to produce a more uniform classification accuracy on the two domains and therefore a higher minimum accuracy.

4 Experiments

In our experiments, we would like to test the ability of our model to achieve invariant representations and to achieve stable classification performance. We first consider two popular datasets that are often used for benchmarking: MNIST [27] and SVHN [41], each of them constituting one domain. The first one is a common black&white digit recognition dataset composed of 60000 training examples, and the second one is another popular dataset of 73257 examples where digits are colored, have more complex appearances, and are harder to predict. We then consider the recent and more complex PACS image recognition dataset [29] which consists of 10000 examples, with 4 domains (Photo, Art, Cartoon, Sketch) and 7 classes.

4.1 Data and Models

In this section we briefly describe the MNIST vs. SVHN and the PACS scenarios, and the models trained on this data. More details are provided in the Supplement. In the MNIST vs. SVHN scenario, we only provide labels of 3000 randomly selected examples from each domain. MNIST images are brought to the SVHN format by scaling and setting each RGB component to the MNIST grayscale value. For experiments in Table 1, the feature extractor is a ResNet-18 [22] and in Table 2, is the Conv-Large model from [37]. Both models take as input images of size . We use small random translations of 2 pixels as well as color jittering as data augmentation. In the PACS dataset scenario, we randomly sample 500 labels from each domain. The classes and domains are imbalanced, i.e. contain a different number of examples. The images are resized to , and a pipeline of data augmentation is applied based on RandAugment [10]. We use the Resnet-18 architecture. On this dataset, we test domain invariance in a ‘one vs. rest’ setting.

In both cases, the classifier is a simple 2-layer MLP, and the discriminator a 3-layer MLP with spectral normalized weights [36]

. (On the multi-domain PACS, we use a discriminator for each domain, computed in a one-vs-rest manner.) The weights (hyperparameters) for each loss term is set to one, except the one of the discriminator which is set on the interval

. Unless mentioned otherwise, the networks are trained for 20 to 50 epochs using the Adam

[25] optimizer.

4.2 Results and Analysis

As a first experiment, we study the effect of the domain critic on the accuracy of the model, and on the Wasserstein distance between the two domains. Table 1 shows the result for the standard (joint) domain critic derived from triangle inequality in Section 3.1, our improved class-dependent domain critic (Section 3.3), a more basic critic based on marginals (such as proposed in [52]), and an absence of domain critic. We report in particular the Wasserstein distance between the two domains’ joint distributions, and the minimum classification accuracy for the two domains: two properties that our domain-invariant network encourages (Theorem 1 and Theorem 2 respectively). For this experiment we do not use any additional losses/regularizers, and simply optimize the classification and discriminator terms.

Accuracy
Domain Critic MNIST SVHN Avg Min W dist.
Class-Dependent (Ours) 96.52 84.59 90.56 84.59 5.50
Joint distributions (Ours) 96.54 78.6 87.57 78.6 10.98
Marginal distributions [52] 96.31 79 87.66 79 12.39
None 91.68 69.82 80.75 69.82 16.51
Table 1: Effect of the domain critic on the classification accuracy and the Wasserstein distance between the two domains. Best performance is shown in bold.

Results corroborate our theory. In particular, we observe that the Wasserstein distance strongly decreases under the effect of adding a domain critic, and the minimal accuracy over the two domains increases. Here, in particular, the use of a class-dependent domain critic to put more focus on leads to the highest accuracy in our benchmark. Surprisingly we achieve an even lower Wasserstein distance when using the upper bound of Lemma 3. We conjecture that having multiple classifier-weighted discriminators eases the joint optimization of the classifier and discriminator loss. An absence of domain critic (last row in the table) deactivates the use of unsupervised data, and leads to significantly lower performance.

A common alternative to leverage unsupervised data is the typical semi-supervised learning formulation. Because semi-supervised learning has shown powerful results on data with manifold structure (e.g. [45, 28]), we add to our benchmark a semi-supervised baseline which consists of a combination of conditional entropy minimization (EntMin) [19] and virtual adversarial training (VAT) [37], two powerful techniques that have shown strong empirical performance on numerous tasks. Results are shown in Table 2.

Accuracy
Model MNIST SVHN Avg Min W dist.
Semi-Sup (Vat + EntMin) on MNIST 99.14
Semi-Sup (Vat + EntMin) on SVHN 94.79
Supervised on Both 98.76 87.33 93.05 87.33 3.12
Semi-Sup (Vat + Entmin) on Both 99.29 91.86 95.58 91.86 3.11
Ours (Vat + EntMin) 99.26 92.75 96.01 92.75 0.78
Ours (Vat + EntMin) + Fine Tuning 99.09 94.33 96.71 94.33 1.97
Table 2: Comparison of our method to supervised and semi-supervised learning baselines (3000 labels per domain). Best results are in bold. For indicative purpose, we report in the first two rows the classification accuracy on individual modalities.

We observe that semi-supervised learning on both domains, complemented by VAT and EntMin regularization techniques leads to a strong baseline, in particular, it achieves the highest performance on MNIST, but lower SVHN performance, leading to lower aggregated accuracy. Our domain invariant approach, combined with the same regularization techniques, improves over the baselines, by achieving higher aggregated accuracy and producing a much more invariant representation. A final supervised fine-tuning step on our learned model further improves the accuracy but at the expense of less domain invariance.

Finally, Table 3 shows prediction performance on the more complex PACS dataset. We test our model on this data in a one-vs-rest setting, so that the model must learn to be invariant between one domain and the three remaining domains.

Art vs. R Cartoon vs. R Photo vs. R Sketch vs. R Overall: Acc. W dist.
Acc. W dist. Acc. W dist. Acc. W dist. Acc. W dist. Avg. Min. Avg.
Ours 77.15 3.87 88.61 5.06 83.41 6.98 71.52 5.03 80.18 71.52 5.24
Marginal 84.08 6.77 87.07 7.48 78.26 9.18 64.44 9.93 78.46 64.44 8.34
None 84.03 6.93 85.62 9.64 78.74 10.97 60.45 10.14 77.21 60.45 9.42
Table 3: Comparison of our method to a classic marginal discriminator and semi-supervised learning (500 labels per domain) on the PACS dataset. Accuracy and Wasserstein distance are reported as Domain vs Rest, and overall. Best results overall are in bold.

Again, we find that our model produces the best minimum and average accuracy in each scenario, and we find that it also has the lowest Wasserstein distance. We found that a trade-off may exist between Art and other domains. Although our method performs worse than competitors on this domain, we observe that it leads not only to a higher average accuracy, but also to domain accuracies more concentrated around the mean. We omit here the joint discriminator as we have observed that using it, all other things being equal, that it performs worse in terms of accuracy and Wasserstein distance than any other listed method.

Lastly, we would like to stress that the problem of domain invariance has received considerably less attention in the context of deep neural networks than the tasks of domain adaptation and domain generalization. Our quantitative results as well as the multiple baseline results aim to provide useful reference values for future work on domain invariance.

4.3 Visual Insights on Learned Representations

While results in the section above have verified quantitatively the performance of the model, we would like to also present some qualitative insights.

As a first experiment, we visualize how the representation becomes more task-specific and less domain-dependent throughout training. For this, we take samples from and , join them, and perform a UMAP [35] analysis. Plots before and after training are shown in Figure 4 (left). We observe that the two domains are strongly separated initially, but under the influence of domain invariant training, they collapse to the same regions in representation space. The learned representation also better resolves the different classes (here roughly given by the cluster structure).

Figure 4: Left: UMAP visualization of the extracted representation before and after training. Right: Response to content (class) and style (color).

As a second experiment, we present SVHN-like synthetic examples to the network and vary the digit and the colors. We then compute for each prediction its response obtained using the LRP explanation method [3] (details in the Supplement). Examples and model responses are shown in Figure 4 (right). Although we would expect that style and color play a marginal role in representation space (our objective has enforced invariance between the colored SVHN and the black&white handwritten MNIST domains), recognizing such style and color variations remains an integral part of the neural network prediction strategy. We indeed observe that the model precisely adapts to the input digit by providing individualized response maps of corresponding colors. This strategy is therefore instrumental in the process of building the domain invariant representation.

5 Conclusion

Real-world data is often heterogeneous, subject to sub-population shifts, or coming from multiple domains. In this work, we have for the first time studied the problem of learning domain-invariant representations as measured by the Wasserstein distance. We have created a theoretical framework for semi-supervised domain invariance and have contributed several upper-bounds to the Wasserstein distance of joint distributions that links domain invariance to practical learning objectives. In our benchmark experiments, we find that optimizing the resulting objective leads to high prediction accuracy on both domains while simultaneously achieving high domain invariance, which we also observe qualitatively on feature maps. We have observed that unlike speculated, domain adversarial training can use domain-specific features to build representations.

Our work allows for several future extensions, the main one being the generalization of our theory to more than two domains, particularly through the use of Wasserstein barycenters which may provide an appropriate framework. Moreover, it would be interesting to obtain a theoretical connection to other representation learning methods, in particular, contrastive learning, that may be integrated to our method. Finally, an extension of our theory to domain generalization could enable more applications and increase our understanding of domain generalization itself.

Overall, our work on domain invariance provides new theoretical insights as well as quantitative competitive results for a number of scenarios and baselines. We believe it thereby constitutes a useful first basis for further research on domain-invariant ML models and applications thereof.

Acknowledgements

This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grants funded by the Korea Government (No. 2017-0-00451, Development of BCI based Brain and Cognitive Computing Technology for Recognizing User’s Intentions using Deep Learning and No. 2019-0-00079, Artificial Intelligence Graduate School Program, Korea University), by the German Ministry for Education and Research (BMBF) under Grants 01IS14013A-E, 01GQ1115, 01GQ0850, 01IS18025A and 01IS18037A; by the German Research Foundation (DFG) under Grant Math+, EXC 2046/1, Project ID 390685689; and by Japan Society for the Promotion of Science under KAKENHI Grant Number 17H00764, and 19H04071.

References

Appendix A Proofs of Main Results

In the following, we give the proofs of the main theoretical results presented in the paper. After providing the formal mathematical proof, we detail for each of them in the paragraph below the steps taken to reach the final result.

Lemma 1.

Let be an arbitrary probability distribution, we then have

(11)
Proof.
(12)
(13)
(14)
(15)
(16)
(17)

We start the proof in eq. 12 by simply stating the definition of the dual of the 1-Wasserstein distance (with the cost function being the metric of our space). Because and and are finite, we can decompose the integral of a mixture of measures as a mixture of integrals of each of the elementary measures, as we do in eq. 13. In eq. 14 we also decompose the second integral into parts weighted by the same and (which sum to one), group those with the corresponding integral on the other measure, and factorize by their weights. Using a property of the supremum in eq. 15, we upper bound the supremum of a sum by the sum of the supremum. What we obtain then by pulling out constant in eq. 16 is a sum of two Wasserstein distances by their dual definition, and we hence obtain eq. 17 and complete the proof. ∎

Lemma 2.

Assuming that and admit densities, we then obtain

(18)
Proof.

In order to prove this result we have to rely on an upper bound of the Wasserstein distance by the Kullback-Leibler divergence, through the combination of two standard bounds. We therefore present this result here and a quick proof.

  

Lemma.
From [17] Let be two probability distributions on a compact measurable space , we then have
(19)
and
(20)
Proof.
Combine the bound of the Wasserstein distance by the Total Variation distance (Theorem 4 of [17]), and that one by the Kullback-Leibler Divergence using Pinsker’s Inequality. ∎
  

With that result, we show that under our conditions, the Kullback-Leibler divergence on marginals is in fact the expected KL divergence (on the marginal distribution ) on the conditional distribution. Let be the densities of respectively .

(21)
(22)
(23)
(24)
(25)

The first line (eq. 21) is the definition of the Kullback-Leibler divergence with densities. Eq. 22 is an application of Fubini’s theorem which allows us to decompose the double integral and a decomposition of joint probability into the product of marginal and conditional. Finally as is a discrete space, the integral becomes a sum of probabilities, and is pulled out of the sum. Eq. 23 replaces the integral by the expectation, by definition. In eq. 24, since by definition , those terms are removed from the fraction. Eq. 25 is again an application of the definition of the KL divergence.
By combining the equality obtained in eq. 25 and the cited lemma, we complete the proof. ∎

Lemma 3.

Assuming that admit densities, we then obtain

(26)

where we have a class-weighted joint distribution unbalanced Wasserstein distance defined as

Proof.

Let us denote the densities of respectively as , we then have

(27)
(28)
(29)
(30)
(31)
(32)

We start this proof in eq. 27 by stating the dual formulation of the 1-Wasserstein, formulated using densities. We decompose the integral using Fubini’s theorem in eq. 28 and apply a property of the supremum () in eq. 29. We then separate the integrals in eq. 30 and replace the integrals by the expectation in eq. 31, to complete the proof with our formulation in eq. 32. ∎

Theorem 1.

Given the cost function used is the metric on the product space :

(33)

with the subscripts being the marginal distributions.

Proof.
(34)
(35)
(36)
(37)
(38)
(39)

We obtain eq. 34 from proposition 1. Using lemma 1 twice on the first and third terms we obtain eq. 35. As the Wasserstein distance is a metric if its cost function is a metric itself (we use the Euclidean distance), it has value 0 if and only if the two distributions are identical, which we observe twice here, in terms 1 and 4, and we can therefore remove them and obtain eq. 36. Finally in eq. 37, we apply lemma 3 on the second term, and lemma 2 on the first and third terms we complete the proof. ∎

Theorem 2.

Let be two compact measurable metric spaces whose product space has dimension and two joint distributions associated to two domains. Let the transport cost function associated to the optimal transport problem be , the Euclidean distance as the metric on and a symmetric -Lipschitz loss function. Then for any and there exists some constant depending on such that for any and with probability at least for all -Lipschitz the following holds:

(40)
Proof.

Let be the optimal coupling.

(41)
(42)
(43)
(44)
(45)
(46)
(47)
(48)
(49)
(50)
(51)
(52)

We start the demonstration by replacing definition by explicit formulations, in eq. 41 and again in eq. 42. In eq. 43 we replace a difference of integrals by the integral of the difference of measures, which leads to a form related to the dual of the Wasserstein distance. A consequence of the Kantorovich-Rubinstein duality theorem is that eq. 43 and 44 are equal for the optimal coupling. The next equation, eq. 45 is a property of the absolute value, namely . We then add two terms summing to zero in eq. 46 and apply the same property of the absolute value again, to obtain eq. 47. Having the absolute value of the difference of a Lipschitz function for two values allows up to upper bound that difference by one on the inputs of the function, up to a Lipschitzness factor. We apply that operation on two terms to obtain eq. 48 and again on the first term to obtain eq. 49. Using the Cauchy-Schwartz inequality, the sum of two Euclidean distances can be upper bounded by the Euclidean distance between the concatenated vectors as in eq. 50, which corresponds to the cost function used in our discriminator throughout the main paper. The next steps correspond to pulling out constant outside of the integral (eq. 51), and replacing the explicit formulation of the 1-Wasserstein distance by its notation.

We have completed the main part of the proof. The next step is to apply a classical concentration bound which allows us to replace the distributions in the Wasserstein distance by their empirical counterparts. We now reintroduce the concentration bound we use:

  

Theorem.
From [7], Theorem 1.1 Let be a probability measure on satisfying a inequality and let be its associated empirical measure. Then, for any and , there exists some constant , depending on , and some square-exponential moment of , such that for any and ,
(53)
  

Now by using the triangle inequality, we can make Wasserstein distances between empirical and true distributions,

(54)

and by applying the concentration bound twice, we obtain our final result and complete the proof:

(55)

Appendix B Details of the Experiments of Section 4.2

Hardware & Computation

All experiments but PACS were conducted on a single RTX 2060 Super. The PACS experiments were conducted on a single TITAN RTX. All experiments were conducted on a desktop computer. Most experiments lasted between 1 and 3 hours and none more than 6 hours.

Implementation

Our model is implemented using pytorch[42] and torchvision as framework, timm[56] and nfnets-pytorch[4] for access to normalizer-free networks, PythonOT[15] to compute the Wasserstein distance reported in the tables. Our code is available at https://github.com/leoandeol/ldir and in the supplemental materials. It contains everything necessary to reproduce experiments to the exception of the data itself, which can be easily obtained from the official sources.

Results of Table 1

We use a Cross-Entropy (equivalent to the Kullback-Leibler divergence in case of a deterministic labeling rule) classification loss with a weight of 1 and no regularization losses, while the domain critic had a weight of 0.1. We use the standard resnet-18 [22], and simple data augmentation (small translations, and color jittering provided with pytorch) for all experiments.

Results of Table 2

We use VAT with and . VAT, conditional entropy, and classification losses all had a weight of 1, while the domain critic had a weight of 0.1. We do not use Virtual MixUp. Fine Tuning (for classification) consists in one more epoch, without discriminator loss, and with the loss intuitively reweighted by the error of each domain (0.25 for MNIST to not forget, 0.75 for SVHN to improve). We use the large Convnet for SVHN from the VAT paper [37], and simple data augmentation (small translations, and color jittering provided with pytorch) for all experiments.

Results of Table 3

Unless stated otherwise, we use the same settings as Table 1. We use VAT with an adaptive radius such as introduced in [23] with the same parameters. We us a different data augmentation, relying on RandAugment [10] with and .

Appendix C Details of the Analyses of Section 4.3

In this section, we give details of the implementation of UMAP and LRP analyses performed on the learned representations and classifiers. We also provide further UMAP visualizations.

c.1 Application of UMAP

Implementation

To compute the two-dimensional UMAP embeddings of the learned representations, we used the official implementation of UMAP [35], with the Euclidean distance and a number of neighbors of 75. We kept all other parameters as default.

Observations

In addition to the embeddings shown in the main paper, we show in Figure 5 further embeddings corresponding to the networks presented in Table 2 of the main paper. Beyond the obvious improvement over untrained features, we observe that supervised and semi-supervised approaches in (b) and (c), extract class structure (visible as distinct clusters), but tends to not produce strongly domain-invariant representations. Our method (d) incorporates domain alignment in the objective and we observe a much stronger overlap between the red and blue points representing the MNIST and SVHN domains respectively. However, we have shown in the paper that a simple epoch of fine tuning can lead to higher accuracies of classification, at the cost of a higher Wasserstein distance, which we observe here in (e) by a worsened domain alignment compared to (d).

Figure 5: Visualization of the representations learned by the neural networks of Table 2 in the main paper. Representations are embedded using UMAP and the different examples are color-coded by domain (SVHN in blue, MNIST in red).

c.2 Application of LRP

We apply the LRP method for the task of attributing the predictions of the considered Conv-Large architecture [37]

to the input pixels and color channels. The Conv-Large architecture is composed of an alternation of convolutions, batch-normalizations, Leaky ReLUs, and max-pooling functions. Before applying LRP, we adopt the strategy described in

[38] of fusing batch-normalization layers into the parameters of the adjacent convolution layers, so that we arrive at a simplified but functionally equivalent neural network which consists only of max-pooling layers and convolution-leakyReLU layers. For the max-pooling layers () we adopt the commonly used winner-take-all redistribution [3], i.e. we redistribute the

to the neuron in the pool that has the maximum activation. For the convolution-leakyReLU layers, we extend the LRP-

rule defined in [38] to account for negative input and output activations. Writing such layers as:

where the convolution is written as a generic weighted sum, where