1 Introduction
Learning from data that originates from different sources representing the same physical observations occurs rather commonly, but it is nevertheless a highly challenging endeavor. These multiple data sources may e.g. originate from different users, acquisition devices, geographical locations, they may encompass batch effects in biology, or they may come from the same measurement devices that each are calibrated differently. Because the source of the data itself is typically not taskrelevant, a learned model is therefore required to be invariant across domains. A valid strategy for achieving this is to learn an invariant intermediate representation (illustrated in Figure 1). Furthermore, in certain applications, privacy requirements such as anonymity dictate that the source should not be recoverable from the representation. Hence, building a domain invariant representation can also be a desideratum by itself.
Domain invariance, in some contexts referred to as subpopulation shift [26] or distributional shifts [1, 11, 44, 48, 14, 13, 18], can be contrasted to two related and wellresearched areas that are domain adaptation [53, 54, 5, 9] and domain generalization [6, 40, 30, 31, 12, 57, 58]. In domain adaptation, we are mainly concerned about the model performance on the (unlabeled) target domain, often at the expense of incurring more errors on the (labeled) source domain. Domain generalization aims to build a model that generalizes across all domains, including unseen ones but all samples are fully labeled. There are few works in domain adaptation [8, 33, 49, 21, 24] and domain generalization [51]
that tackled a more realistic and flexible approach where all domains are partially labeled, and none of them to our knowledge have a rigorous theoretical background. That is due to the inherent difficulty of studying partially labeled probability distributions. Furthermore, the generality imposes additional constraints on the solution, that can hamper the careful enforcement on the invariance on the domains at hand. Hence, domain invariance, our focus in this paper, addresses a singular and important problem, which has so far received little attention, especially in the context of deep learning models, and there is no work that we are aware of that considers partially labeled domains.
In order to address domain invariance, we consider the Wasserstein distance [43, 55] in the present work, as it characterizes the weak convergence of measures and displays several advantages as discussed in [2]. We contribute several bounds relating the Wasserstein distance between the joint distributions of (here two) domains and upon those construct an objective function of practical domain invariant networks based on standard crossentropy classifiers and a newly proposed, theorybased domain discriminator^{1}^{1}1Anecdotally, the use of a domain discriminator makes our method relate to works on domain adaptation such as DANN [16] and WDGRL [52].. A significant part of the novelty of our work lies in the different formalism and problem setting that we have adopted, allowing us to establish theory for formally studying partially labeled distributions. In our framework, however, the classifiers can be trained either in a fully supervised or in a semisupervised manner. With the proposed theoretical grounding, the studied classifiers can not only achieve high classification accuracy on the task of interest, their successful domain invariant training mechanistically lowers the Wasserstein distance between the two domains, thereby inducing additional desirable sideeffects such as improving domain privacy, or exposing the joint domain subspace, that can subsequently mined for new insights.
Our proposed model is analyzed on several domain invariance benchmark tasks (MNIST vs. SVHN, and the multidomain PACS dataset). In all cases, our novel framework is able to simultaneously achieve
high classification accuracy on all domains and high invariance of the representation. Lastly, we conduct an inspection of the learned invariant representation using UMAP embeddings [35] and ‘explainable AI’ (cf. [47, 50]). Our inspection highlights across the whole dataset the overall qualities of our proposed learned invariant representation. We also explore whether building such an invariant representation requires to look at the intersection of features of the two domains [32]. We find interestingly that recognizing and exploiting domainspecificfeatures remains in fact an integral part of the neural network strategy to arrive at the desired invariant representation.
2 Domain Invariance and Optimal Transport
Domain invariance can be described as the property of a representation to be indistinguishable with regards to its original domain, in particular, the multiple data distributions projected in representation space should look the same (i.e. have low distance). A recently popular way of measuring the distance between two distributions is the Wasserstein distance. The latter can be interpreted as the cost of transporting the probability mass of one distribution to the other if we follow the optimal transport plan, and it can be formally defined as follows:
Definition 1.
Let be two arbitrary probability distributions defined over two measurable metric spaces and . Let be a cost function. Their Wasserstein distance is:
(1) 
with , where and are pushforwards of the projection of and . This can be loosely interpreted as being the set of joint distributions that have marginals and .
Hence, we measure the invariance of representations by how low the Wasserstein distance is between the distributions and
associated to the two domains. In comparison to other common alternatives such as the KullbackLeibler divergence, the JensenShannon divergence, or the Total Variation distance, the Wasserstein distance has the advantage of taking into account the metric of the representation space (via the cost function
), instead of looking at pure distributional overlap, and this typically leads to better ML models [39, 2]. Computing the Wasserstein distance with Eq. (1) is expensive. Luckily, if we use the metric of our space as a cost function, such as the Euclidean distance , we can derive a dual formulation of the 1Wasserstein distance as follows:(2) 
This formulation replaces an explicit computation of a transport plan, by a function to estimate, a task particularly appropriate for neural networks. Recently several methods have used this approach to learn distributions
[39]specifically in the context of Generative Adversarial Networks
[2, 52]. The main constraint lies in the necessity of the function, which we will call the discriminator, to be 1Lipschitz. A few approaches were proposed to tackle this problem, such as gradient clipping
[2], gradient penalty [20] and more recently the Spectral Normalization [36]. It is however important to note that in practice the set of possible discriminators will be a subset of 1Lipschitz continuous functions.3 Learning Domain Invariant Representations
This section focuses on the core question of our work, which is how to build a domain invariant representation. We start by some notation: We denote by our representation or feature space, by our label or target space, both assumed to be compact measurable spaces, and by the set of probability distributions defined on their product space. Consider we have two true probability distributions . When necessary, we add a subscript to these distributions to specify their support. We now want to align these distributions, i.e. we would like our classifier to be invariant to them. Similarly to previous works, we consider the Wasserstein distance of samples embedded in feature space, and we will therefore want to learn a mapping
which can take the form of a feature extraction function
followed by a classifier .We note that an invariant representation is by itself very simple to obtain if we do not subject it to the requirement that it should be inductive and solve an actual machine learning task (e.g. classification). Our strategy for extracting the desired representation consists of starting with the Wasserstein distance , and upperbound it in multiple stages, in a way that terms of the desired machine learning objective progressively appear, e.g. a classification function that can be optimized for maximum accuracy.—From a theoretical perspective, this allows us to demonstrate how flavors of the data, such as the availability of both supervised and unsupervised data, can be iteratively and flexibly worked into the upperbound.
The individual steps are presented in Sections 3.1–3.3. (See Figure 2 for an overview of our novel theoretical framework.) Our novel approach draws some inspiration from [9] and is based on measure theory, in order to formalize partially labeled distributions and therefore our problem of aligning multiple joint distributions.
3.1 Incorporating SemiSupervised Data
Computation of the true Wasserstein distance would require knowledge of the true distributions and . In practice, we only have a finite sample of these distributions, and the quality with which the Wasserstein distance can be approximated largely depends on the amount of labeled data available. (For highdimensional tasks, the necessary amount of labels would be overwhelming.) However, in practice, it is common that unlabeled data is available in much larger quantity than the labeled data. We consider this semilabeled scenario where only a fraction are really sampled from (i.e. labeled). The remaining samples are unlabeled and obtained from the marginal , and by using a function that outputs labels given features, say, a DNN classifier, we obtain an estimate of the true joint distribution . Therefore the distribution we observe samples from, is a mixture . Identically for , . Note however that the and need not be identical, and , the proportion of labeled samples in may differ from . We start by simply applying the triangle inequality in order to make the distance between the partially labeled distributions appear.
Proposition 1.
Let the cost function be the metric on the space , we then have
(3) 
Proof.
The 1Wasserstein distance being a metric, we can apply the triangle inequality, and obtain the proposition by applying it twice. ∎
We observe that the upperbound consists of multiple terms. First, and , the distances between observed and true joint distributions of respectively and . Second, , the distance between the observed distributions and . We analyze these quantities subsequently.
3.2 Incorporating a CrossEntropy Classifier
Let us first analyse the distance between observed and true joint distributions . We will consider here the case of (analogously so for ). Intuitively, since is an element of the mixture , by computing the distance between the two, we partly compute the distance to itself. Indeed, in practice, the decomposition of the Wasserstein distance with respect to the mixture is an upper bound of our original distance, using the Jensen inequality on the supremum (the proof can be found in the Supplement). The result is summarized in the following lemma.
Lemma 1.
Let be an arbitrary probability distribution, we then have
(4) 
Therefore, since the Wasserstein distance is symmetric, we can upper bound our distance as , since . However, this new formulation, which is the distance between true and estimated distribution, is intuitively a classification task. We therefore upper bound this distance using tools more common in the machine learning community, namely the KullbackLeibler divergence, which is equivalent to the crossentropy loss when is deterministic. The following lemma details this upper bound (for proof see Supplement).
Lemma 2.
Assuming that and admit densities, we then obtain
(5) 
where is the KullbackLeibler divergence; and where is the diameter of the space , i.e. the largest distance obtainable in that space. We have now obtained an upper bound of our original distance between and that leads to a classification task that is easily computable, therefore linking theory to practical applications, and common usages.
3.3 Adding a ClassDependent Domain Critic
Concerning the distance between partially labeled distributions , we can compute its empirical version as the data we have is sampled from those distributions. In the practical case where we want to maximize domain invariance irrespectively of the qualitative value of the representations for machine learning, we may use this distance. However there is an inherent issue to using the common dual of the 1Wasserstein distance (Eq. 2), which is in fact a specific case of the 1Wasserstein dual. This specific case arises when the cost function used is the metric of our space, commonly, the Euclidean distance which entails
. In practice, the norm of our representation vectors
are usually larger than the one of the labels(represented as onehot vectors, as they are categorical variables) by several orders of magnitude. Optimal transport being oblivious to the objectives of machine learning, will focus on minimizing distance between samples irrespectively of their labels. We therefore explore a refinement of this term that is specific to the classification setting, and we obtain Lemma
3.Lemma 3.
Assuming that admit densities, we then obtain
(6) 
where we have a classweighted joint distribution unbalanced Wasserstein distance defined as
This Wasserstein distance can be seen as a distance between joint distributions for a fixed value of . This lemma therefore indicates that we can decompose a Wasserstein distance on a joint distribution into a sum over class labels y of unbalanced Wasserstein distances on fixedy joint distributions, i.e. a distance between . Practically, our discriminator network would have as many outputs as we have classes, and its output given a sample would be weighed by the conditional probability vector (onehot vector for known labels, softmax output of the classifier otherwise). A more straightforward interpretation, is that we can minimize the Wasserstein distance between matching classes of each domain, independently. This general idea has been explored in the past in [31] but with a different approach and with stronger assumptions (e.g. a uniform class distribution for each domain).
3.4 Formulation of the Objective Function
By applying all the previous results to the original proposition, that is, by applying the triangle inequality and upperbounding the expanded terms, we obtain an upperbound on the Wasserstein distance composed exclusively of terms that we will be able to meaningfully incorporate in a practical machine learning model:
Theorem 1.
Given the cost function used is the metric on the product space :
(7) 
with the subscripts being the marginal distributions.
This result informs us on how to build an objective function that solves the classification task and also minimizes the Wasserstein distance between the two true distributions. We will now derive it step by step. Let us first consider that our data consists of the samples and labels from both domains sampled from the distributions we referred to as “observed”. Formally, we have and where and are the number of samples obtained from each domain. We can now derive straightforwardly from the right hand side of Lemma 2 respectively and defined in their empirical version, assuming deterministic labels, as follows
(8) 
Moreover, to define we need to introduce an additional notation: the discriminator , or domain critic, which corresponds to the function as it appears in the Eq. (2). However, as per Lemma 3, does not have one output, but one for each class, i.e. .
Finally, as we use a shared classifier for the domains and , we will denote it by . Lastly, we use to denote our feature extractor. We can therefore formulate the optimization problem as follows, by including the Lipschitzness constraint on the discriminator, as it is necessary to estimate the Wasserstein distance.
(9) 
The Lipschitzness constraint is practically enforced by using one of the regularization techniques mentioned at the end of Section 2. Supplementary regularization terms, such as EntMin [19], Virtual Adversarial Training [37], and Virtual Mixup [34] can be added to the objective to take further advantage of the unlabeled examples. A visual representation of our model is given in Figure 3.
3.5 Generalization Bounds
Interestingly, the Wasserstein distance between the true distributions of the two domains(that we have upperbounded in Theorem 1) can also be related to the risks of the classifier on the two domains. Let be the risk or error of a classifier . We here develop a result using the joint Wasserstein distance, similar to previous a previous obtained by [46] on the distance between marginals.
Theorem 2.
Let be two compact measurable metric spaces whose product space has dimension and two joint distributions associated to two domains. Let the transport cost function associated to the optimal transport problem be , the Euclidean distance as the metric on and a symmetric
Lipschitz loss function. Then for any
and there exists some constant depending on such that for any and with probability at least for all Lipschitz the following holds:(10) 
In other words, the Wasserstein distance between the two domains upperbounds the prediction performance gap between the two domains. In practice, we can therefore expect the optimization of the objective in Eq. (9) to not only reduce the Wasserstein distance between domains (as we have shown in the previous sections), but also to produce a more uniform classification accuracy on the two domains and therefore a higher minimum accuracy.
4 Experiments
In our experiments, we would like to test the ability of our model to achieve invariant representations and to achieve stable classification performance. We first consider two popular datasets that are often used for benchmarking: MNIST [27] and SVHN [41], each of them constituting one domain. The first one is a common black&white digit recognition dataset composed of 60000 training examples, and the second one is another popular dataset of 73257 examples where digits are colored, have more complex appearances, and are harder to predict. We then consider the recent and more complex PACS image recognition dataset [29] which consists of 10000 examples, with 4 domains (Photo, Art, Cartoon, Sketch) and 7 classes.
4.1 Data and Models
In this section we briefly describe the MNIST vs. SVHN and the PACS scenarios, and the models trained on this data. More details are provided in the Supplement. In the MNIST vs. SVHN scenario, we only provide labels of 3000 randomly selected examples from each domain. MNIST images are brought to the SVHN format by scaling and setting each RGB component to the MNIST grayscale value. For experiments in Table 1, the feature extractor is a ResNet18 [22] and in Table 2, is the ConvLarge model from [37]. Both models take as input images of size . We use small random translations of 2 pixels as well as color jittering as data augmentation. In the PACS dataset scenario, we randomly sample 500 labels from each domain. The classes and domains are imbalanced, i.e. contain a different number of examples. The images are resized to , and a pipeline of data augmentation is applied based on RandAugment [10]. We use the Resnet18 architecture. On this dataset, we test domain invariance in a ‘one vs. rest’ setting.
In both cases, the classifier is a simple 2layer MLP, and the discriminator a 3layer MLP with spectral normalized weights [36]
. (On the multidomain PACS, we use a discriminator for each domain, computed in a onevsrest manner.) The weights (hyperparameters) for each loss term is set to one, except the one of the discriminator which is set on the interval
. Unless mentioned otherwise, the networks are trained for 20 to 50 epochs using the Adam
[25] optimizer.4.2 Results and Analysis
As a first experiment, we study the effect of the domain critic on the accuracy of the model, and on the Wasserstein distance between the two domains. Table 1 shows the result for the standard (joint) domain critic derived from triangle inequality in Section 3.1, our improved classdependent domain critic (Section 3.3), a more basic critic based on marginals (such as proposed in [52]), and an absence of domain critic. We report in particular the Wasserstein distance between the two domains’ joint distributions, and the minimum classification accuracy for the two domains: two properties that our domaininvariant network encourages (Theorem 1 and Theorem 2 respectively). For this experiment we do not use any additional losses/regularizers, and simply optimize the classification and discriminator terms.
Accuracy  
Domain Critic  MNIST  SVHN  Avg  Min  W dist. 
ClassDependent (Ours)  96.52  84.59  90.56  84.59  5.50 
Joint distributions (Ours)  96.54  78.6  87.57  78.6  10.98 
Marginal distributions [52]  96.31  79  87.66  79  12.39 
None  91.68  69.82  80.75  69.82  16.51 
Results corroborate our theory. In particular, we observe that the Wasserstein distance strongly decreases under the effect of adding a domain critic, and the minimal accuracy over the two domains increases. Here, in particular, the use of a classdependent domain critic to put more focus on leads to the highest accuracy in our benchmark. Surprisingly we achieve an even lower Wasserstein distance when using the upper bound of Lemma 3. We conjecture that having multiple classifierweighted discriminators eases the joint optimization of the classifier and discriminator loss. An absence of domain critic (last row in the table) deactivates the use of unsupervised data, and leads to significantly lower performance.
A common alternative to leverage unsupervised data is the typical semisupervised learning formulation. Because semisupervised learning has shown powerful results on data with manifold structure (e.g. [45, 28]), we add to our benchmark a semisupervised baseline which consists of a combination of conditional entropy minimization (EntMin) [19] and virtual adversarial training (VAT) [37], two powerful techniques that have shown strong empirical performance on numerous tasks. Results are shown in Table 2.
Accuracy  
Model  MNIST  SVHN  Avg  Min  W dist. 
SemiSup (Vat + EntMin) on MNIST  99.14  
SemiSup (Vat + EntMin) on SVHN  94.79  
Supervised on Both  98.76  87.33  93.05  87.33  3.12 
SemiSup (Vat + Entmin) on Both  99.29  91.86  95.58  91.86  3.11 
Ours (Vat + EntMin)  99.26  92.75  96.01  92.75  0.78 
Ours (Vat + EntMin) + Fine Tuning  99.09  94.33  96.71  94.33  1.97 
We observe that semisupervised learning on both domains, complemented by VAT and EntMin regularization techniques leads to a strong baseline, in particular, it achieves the highest performance on MNIST, but lower SVHN performance, leading to lower aggregated accuracy. Our domain invariant approach, combined with the same regularization techniques, improves over the baselines, by achieving higher aggregated accuracy and producing a much more invariant representation. A final supervised finetuning step on our learned model further improves the accuracy but at the expense of less domain invariance.
Finally, Table 3 shows prediction performance on the more complex PACS dataset. We test our model on this data in a onevsrest setting, so that the model must learn to be invariant between one domain and the three remaining domains.
Art vs. R  Cartoon vs. R  Photo vs. R  Sketch vs. R  Overall: Acc.  W dist.  

Acc.  W dist.  Acc.  W dist.  Acc.  W dist.  Acc.  W dist.  Avg.  Min.  Avg.  
Ours  77.15  3.87  88.61  5.06  83.41  6.98  71.52  5.03  80.18  71.52  5.24 
Marginal  84.08  6.77  87.07  7.48  78.26  9.18  64.44  9.93  78.46  64.44  8.34 
None  84.03  6.93  85.62  9.64  78.74  10.97  60.45  10.14  77.21  60.45  9.42 
Again, we find that our model produces the best minimum and average accuracy in each scenario, and we find that it also has the lowest Wasserstein distance. We found that a tradeoff may exist between Art and other domains. Although our method performs worse than competitors on this domain, we observe that it leads not only to a higher average accuracy, but also to domain accuracies more concentrated around the mean. We omit here the joint discriminator as we have observed that using it, all other things being equal, that it performs worse in terms of accuracy and Wasserstein distance than any other listed method.
Lastly, we would like to stress that the problem of domain invariance has received considerably less attention in the context of deep neural networks than the tasks of domain adaptation and domain generalization. Our quantitative results as well as the multiple baseline results aim to provide useful reference values for future work on domain invariance.
4.3 Visual Insights on Learned Representations
While results in the section above have verified quantitatively the performance of the model, we would like to also present some qualitative insights.
As a first experiment, we visualize how the representation becomes more taskspecific and less domaindependent throughout training. For this, we take samples from and , join them, and perform a UMAP [35] analysis. Plots before and after training are shown in Figure 4 (left). We observe that the two domains are strongly separated initially, but under the influence of domain invariant training, they collapse to the same regions in representation space. The learned representation also better resolves the different classes (here roughly given by the cluster structure).
As a second experiment, we present SVHNlike synthetic examples to the network and vary the digit and the colors. We then compute for each prediction its response obtained using the LRP explanation method [3] (details in the Supplement). Examples and model responses are shown in Figure 4 (right). Although we would expect that style and color play a marginal role in representation space (our objective has enforced invariance between the colored SVHN and the black&white handwritten MNIST domains), recognizing such style and color variations remains an integral part of the neural network prediction strategy. We indeed observe that the model precisely adapts to the input digit by providing individualized response maps of corresponding colors. This strategy is therefore instrumental in the process of building the domain invariant representation.
5 Conclusion
Realworld data is often heterogeneous, subject to subpopulation shifts, or coming from multiple domains. In this work, we have for the first time studied the problem of learning domaininvariant representations as measured by the Wasserstein distance. We have created a theoretical framework for semisupervised domain invariance and have contributed several upperbounds to the Wasserstein distance of joint distributions that links domain invariance to practical learning objectives. In our benchmark experiments, we find that optimizing the resulting objective leads to high prediction accuracy on both domains while simultaneously achieving high domain invariance, which we also observe qualitatively on feature maps. We have observed that unlike speculated, domain adversarial training can use domainspecific features to build representations.
Our work allows for several future extensions, the main one being the generalization of our theory to more than two domains, particularly through the use of Wasserstein barycenters which may provide an appropriate framework. Moreover, it would be interesting to obtain a theoretical connection to other representation learning methods, in particular, contrastive learning, that may be integrated to our method. Finally, an extension of our theory to domain generalization could enable more applications and increase our understanding of domain generalization itself.
Overall, our work on domain invariance provides new theoretical insights as well as quantitative competitive results for a number of scenarios and baselines. We believe it thereby constitutes a useful first basis for further research on domaininvariant ML models and applications thereof.
Acknowledgements
This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grants funded by the Korea Government (No. 2017000451, Development of BCI based Brain and Cognitive Computing Technology for Recognizing User’s Intentions using Deep Learning and No. 2019000079, Artificial Intelligence Graduate School Program, Korea University), by the German Ministry for Education and Research (BMBF) under Grants 01IS14013AE, 01GQ1115, 01GQ0850, 01IS18025A and 01IS18037A; by the German Research Foundation (DFG) under Grant Math+, EXC 2046/1, Project ID 390685689; and by Japan Society for the Promotion of Science under KAKENHI Grant Number 17H00764, and 19H04071.
References
 [1] D. Amodei, C. Olah, J. Steinhardt, P. F. Christiano, J. Schulman, and D. Mané. Concrete problems in AI safety. CoRR, abs/1606.06565, 2016.
 [2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223. PMLR, 2017.
 [3] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.R. Müller, and W. Samek. On pixelwise explanations for nonlinear classifier decisions by layerwise relevance propagation. PLoS ONE, 10(7):e0130140, 07 2015.

[4]
V. Balloli.
A pytorch implementation of nfnets and adaptive gradient clipping, 2021.
 [5] S. BenDavid, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. Machine learning, 79(1):151–175, 2010.
 [6] G. Blanchard, G. Lee, and C. Scott. Generalizing from several related classification tasks to a new unlabeled sample. Advances in neural information processing systems, 24:2178–2186, 2011.
 [7] F. Bolley, A. Guillin, and C. Villani. Quantitative concentration inequalities for empirical measures on noncompact spaces. Probability Theory and Related Fields, 137(34):541–593, 2007.
 [8] L. Cheng and S. J. Pan. Semisupervised domain adaptation on manifolds. IEEE transactions on neural networks and learning systems, 25(12):2240–2249, 2014.
 [9] N. Courty, R. Flamary, A. Habrard, and A. Rakotomamonjy. Joint distribution optimal transportation for domain adaptation. In NIPS 2017, 2017.

[10]
E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le.
RandAugment: Practical automated data augmentation with a reduced
search space.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops
, pages 702–703, 2020. 
[11]
E. Delage and Y. Ye.
Distributionally robust optimization under moment uncertainty with application to datadriven problems.
Operations research, 58(3):595–612, 2010.  [12] Q. Dou, D. C. Castro, K. Kamnitsas, and B. Glocker. Domain generalization via modelagnostic learning of semantic features. arXiv preprint arXiv:1910.13580, 2019.
 [13] J. Duchi, T. Hashimoto, and H. Namkoong. Distributionally robust losses for latent covariate mixtures. arXiv preprint arXiv:2007.13982, 2020.
 [14] J. Duchi and H. Namkoong. Learning models with uniform performance via distributionally robust optimization. arXiv preprint arXiv:1810.08750, 2018.
 [15] R. Flamary, N. Courty, A. Gramfort, M. Z. Alaya, A. Boisbunon, S. Chambon, L. Chapel, A. Corenflos, K. Fatras, N. Fournier, L. Gautheron, N. T. Gayraud, H. Janati, A. Rakotomamonjy, I. Redko, A. Rolet, A. Schutz, V. Seguy, D. J. Sutherland, R. Tavenard, A. Tong, and T. Vayer. POT: Python Optimal Transport. Journal of Machine Learning Research, 22(78):1–8, 2021.
 [16] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky. Domainadversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
 [17] A. L. Gibbs and F. E. Su. On choosing and bounding probability metrics. International statistical review, 70(3):419–435, 2002.
 [18] K. Goel, A. Gu, Y. Li, and C. Ré. Model patching: Closing the subgroup performance gap with data augmentation. arXiv preprint arXiv:2008.06775, 2020.
 [19] Y. Grandvalet, Y. Bengio, et al. Semisupervised learning by entropy minimization. In CAP, pages 281–296, 2005.
 [20] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of Wasserstein GANs. arXiv preprint arXiv:1704.00028, 2017.
 [21] G. He, X. Liu, F. Fan, and J. You. Classificationaware semisupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 964–965, 2020.
 [22] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [23] W. Hu, T. Miyato, S. Tokui, E. Matsumoto, and M. Sugiyama. Learning discrete representations via information maximizing selfaugmented training. In International Conference on Machine Learning, pages 1558–1567. PMLR, 2017.
 [24] T. Kim and C. Kim. Attract, perturb, and explore: Learning a feature alignment network for semisupervised domain adaptation. In European Conference on Computer Vision, pages 591–607. Springer, 2020.
 [25] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Y. Bengio and Y. LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, 2015.
 [26] P. W. Koh, S. Sagawa, H. Marklund, S. M. Xie, M. Zhang, A. Balsubramani, W. Hu, M. Yasunaga, R. L. Phillips, I. Gao, et al. Wilds: A benchmark of inthewild distribution shifts. arXiv preprint arXiv:2012.07421, 2020.
 [27] Y. LeCun. The MNIST database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
 [28] C. Li, K. Xu, J. Zhu, and B. Zhang. Triple generative adversarial nets. arXiv preprint arXiv:1703.02291, 2017.
 [29] D. Li, Y. Yang, Y.Z. Song, and T. M. Hospedales. Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision, pages 5542–5550, 2017.
 [30] H. Li, S. J. Pan, S. Wang, and A. C. Kot. Domain generalization with adversarial feature learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5400–5409, 2018.
 [31] Y. Li, X. Tian, M. Gong, Y. Liu, T. Liu, K. Zhang, and D. Tao. Deep domain generalization via conditional invariant adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 624–639, 2018.
 [32] H. Liu, M. Long, J. Wang, and M. Jordan. Transferable adversarial training: A general approach to adapting deep classifiers. In International Conference on Machine Learning, pages 4013–4022. PMLR, 2019.
 [33] D. LopezPaz, J. M. HernándezLobato, and B. Schölkopf. Semisupervised domain adaptation with nonparametric copulas. arXiv preprint arXiv:1301.0142, 2013.
 [34] X. Mao, Y. Ma, Z. Yang, Y. Chen, and Q. Li. Virtual mixup training for unsupervised domain adaptation. arXiv preprint arXiv:1905.04215, 2019.

[35]
L. McInnes, J. Healy, N. Saul, and L. Großberger.
UMAP: Uniform manifold approximation and projection.
Journal of Open Source Software
, 3(29):861, 2018.  [36] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018.
 [37] T. Miyato, S.i. Maeda, M. Koyama, and S. Ishii. Virtual adversarial training: a regularization method for supervised and semisupervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993, 2018.
 [38] G. Montavon, A. Binder, S. Lapuschkin, W. Samek, and K.R. Müller. Layerwise relevance propagation: An overview. In Explainable AI, volume 11700 of Lecture Notes in Computer Science, pages 193–209. Springer, 2019.

[39]
G. Montavon, K.R. Müller, and M. Cuturi.
Wasserstein training of restricted Boltzmann machines.
In NIPS, pages 3711–3719, 2016.  [40] K. Muandet, D. Balduzzi, and B. Schölkopf. Domain generalization via invariant feature representation. In International Conference on Machine Learning, pages 10–18. PMLR, 2013.
 [41] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011.
 [42] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. PyTorch: An imperative style, highperformance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlchéBuc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.

[43]
G. Peyré, M. Cuturi, et al.
Computational optimal transport: With applications to data science.
Foundations and Trends® in Machine Learning, 11(56):355–607, 2019.  [44] H. Rahimian and S. Mehrotra. Distributionally robust optimization: A review. arXiv preprint arXiv:1908.05659, 2019.
 [45] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko. Semisupervised learning with ladder networks. In NIPS, pages 3546–3554, 2015.
 [46] I. Redko, A. Habrard, and M. Sebban. Theoretical analysis of domain adaptation with optimal transport. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 737–753. Springer, 2017.
 [47] M. T. Ribeiro, S. Singh, and C. Guestrin. “why should I trust you?”: Explaining the predictions of any classifier. In KDD, pages 1135–1144. ACM, 2016.
 [48] S. Sagawa, P. W. Koh, T. B. Hashimoto, and P. Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worstcase generalization. arXiv preprint arXiv:1911.08731, 2019.
 [49] K. Saito, D. Kim, S. Sclaroff, T. Darrell, and K. Saenko. Semisupervised domain adaptation via minimax entropy. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8050–8058, 2019.
 [50] W. Samek, G. Montavon, S. Lapuschkin, C. J. Anders, and K.R. Müller. Explaining deep neural networks and beyond: A review of methods and applications. Proceedings of the IEEE, 109(3):247–278, 2021.
 [51] H. SharifiNoghabi, H. Asghari, N. Mehrasa, and M. Ester. Domain generalization via semisupervised meta learning. arXiv preprint arXiv:2009.12658, 2020.
 [52] J. Shen, Y. Qu, W. Zhang, and Y. Yu. Wasserstein distance guided representation learning for domain adaptation. arXiv preprint arXiv:1707.01217, 2017.
 [53] H. Shimodaira. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of statistical planning and inference, 90(2):227–244, 2000.
 [54] M. Sugiyama, M. Krauledat, and K.R. Müller. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8(5), 2007.
 [55] C. Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
 [56] R. Wightman. PyTorch image models, 2019.
 [57] F. Zhou, Z. Jiang, C. Shui, B. Wang, and B. Chaibdraa. Domain generalization with optimal transport and metric learning. arXiv preprint arXiv:2007.10573, 2020.
 [58] K. Zhou, Y. Yang, Y. Qiao, and T. Xiang. Domain generalization with MixStyle. arXiv preprint arXiv:2104.02008, 2021.
Appendix A Proofs of Main Results
In the following, we give the proofs of the main theoretical results presented in the paper. After providing the formal mathematical proof, we detail for each of them in the paragraph below the steps taken to reach the final result.
Lemma 1.
Let be an arbitrary probability distribution, we then have
(11) 
Proof.
(12)  
(13)  
(14)  
(15)  
(16)  
(17) 
We start the proof in eq. 12 by simply stating the definition of the dual of the 1Wasserstein distance (with the cost function being the metric of our space). Because and and are finite, we can decompose the integral of a mixture of measures as a mixture of integrals of each of the elementary measures, as we do in eq. 13. In eq. 14 we also decompose the second integral into parts weighted by the same and (which sum to one), group those with the corresponding integral on the other measure, and factorize by their weights. Using a property of the supremum in eq. 15, we upper bound the supremum of a sum by the sum of the supremum. What we obtain then by pulling out constant in eq. 16 is a sum of two Wasserstein distances by their dual definition, and we hence obtain eq. 17 and complete the proof. ∎
Lemma 2.
Assuming that and admit densities, we then obtain
(18) 
Proof.
In order to prove this result we have to rely on an upper bound of the Wasserstein distance by the KullbackLeibler divergence, through the combination of two standard bounds. We therefore present this result here and a quick proof.
Lemma.
From [17] Let be two probability distributions on a compact measurable space , we then haveProof.
Combine the bound of the Wasserstein distance by the Total Variation distance (Theorem 4 of [17]), and that one by the KullbackLeibler Divergence using Pinsker’s Inequality. ∎With that result, we show that under our conditions, the KullbackLeibler divergence on marginals is in fact the expected KL divergence (on the marginal distribution ) on the conditional distribution. Let be the densities of respectively .
(21)  
(22)  
(23)  
(24)  
(25) 
The first line (eq. 21) is the definition of the KullbackLeibler divergence with densities. Eq. 22 is an application of Fubini’s theorem which allows us to decompose the double integral and a decomposition of joint probability into the product of marginal and conditional. Finally as is a discrete space, the integral becomes a sum of probabilities, and is pulled out of the sum. Eq. 23 replaces the integral by the expectation, by definition. In eq. 24, since by definition , those terms are removed from the fraction. Eq. 25 is again an application of the definition of the KL divergence.
By combining the equality obtained in eq. 25 and the cited lemma, we complete the proof.
∎
Lemma 3.
Assuming that admit densities, we then obtain
(26) 
where we have a classweighted joint distribution unbalanced Wasserstein distance defined as
Proof.
Let us denote the densities of respectively as , we then have
(27)  
(28)  
(29)  
(30)  
(31)  
(32) 
We start this proof in eq. 27 by stating the dual formulation of the 1Wasserstein, formulated using densities. We decompose the integral using Fubini’s theorem in eq. 28 and apply a property of the supremum () in eq. 29. We then separate the integrals in eq. 30 and replace the integrals by the expectation in eq. 31, to complete the proof with our formulation in eq. 32. ∎
Theorem 1.
Given the cost function used is the metric on the product space :
(33) 
with the subscripts being the marginal distributions.
Proof.
(34)  
(35)  
(36)  
(37)  
(38)  
(39) 
We obtain eq. 34 from proposition 1. Using lemma 1 twice on the first and third terms we obtain eq. 35. As the Wasserstein distance is a metric if its cost function is a metric itself (we use the Euclidean distance), it has value 0 if and only if the two distributions are identical, which we observe twice here, in terms 1 and 4, and we can therefore remove them and obtain eq. 36. Finally in eq. 37, we apply lemma 3 on the second term, and lemma 2 on the first and third terms we complete the proof. ∎
Theorem 2.
Let be two compact measurable metric spaces whose product space has dimension and two joint distributions associated to two domains. Let the transport cost function associated to the optimal transport problem be , the Euclidean distance as the metric on and a symmetric Lipschitz loss function. Then for any and there exists some constant depending on such that for any and with probability at least for all Lipschitz the following holds:
(40) 
Proof.
Let be the optimal coupling.
(41)  
(42)  
(43)  
(44)  
(45)  
(46)  
(47)  
(48)  
(49)  
(50)  
(51)  
(52) 
We start the demonstration by replacing definition by explicit formulations, in eq. 41 and again in eq. 42. In eq. 43 we replace a difference of integrals by the integral of the difference of measures, which leads to a form related to the dual of the Wasserstein distance. A consequence of the KantorovichRubinstein duality theorem is that eq. 43 and 44 are equal for the optimal coupling. The next equation, eq. 45 is a property of the absolute value, namely . We then add two terms summing to zero in eq. 46 and apply the same property of the absolute value again, to obtain eq. 47. Having the absolute value of the difference of a Lipschitz function for two values allows up to upper bound that difference by one on the inputs of the function, up to a Lipschitzness factor. We apply that operation on two terms to obtain eq. 48 and again on the first term to obtain eq. 49. Using the CauchySchwartz inequality, the sum of two Euclidean distances can be upper bounded by the Euclidean distance between the concatenated vectors as in eq. 50, which corresponds to the cost function used in our discriminator throughout the main paper. The next steps correspond to pulling out constant outside of the integral (eq. 51), and replacing the explicit formulation of the 1Wasserstein distance by its notation.
We have completed the main part of the proof. The next step is to apply a classical concentration bound which allows us to replace the distributions in the Wasserstein distance by their empirical counterparts. We now reintroduce the concentration bound we use:
Theorem.
From [7], Theorem 1.1 Let be a probability measure on satisfying a inequality and let be its associated empirical measure. Then, for any and , there exists some constant , depending on , and some squareexponential moment of , such that for any and ,Now by using the triangle inequality, we can make Wasserstein distances between empirical and true distributions,
(54) 
and by applying the concentration bound twice, we obtain our final result and complete the proof:
(55) 
∎
Appendix B Details of the Experiments of Section 4.2
Hardware & Computation
All experiments but PACS were conducted on a single RTX 2060 Super. The PACS experiments were conducted on a single TITAN RTX. All experiments were conducted on a desktop computer. Most experiments lasted between 1 and 3 hours and none more than 6 hours.
Implementation
Our model is implemented using pytorch[42] and torchvision as framework, timm[56] and nfnetspytorch[4] for access to normalizerfree networks, PythonOT[15] to compute the Wasserstein distance reported in the tables. Our code is available at https://github.com/leoandeol/ldir and in the supplemental materials. It contains everything necessary to reproduce experiments to the exception of the data itself, which can be easily obtained from the official sources.
Results of Table 1
We use a CrossEntropy (equivalent to the KullbackLeibler divergence in case of a deterministic labeling rule) classification loss with a weight of 1 and no regularization losses, while the domain critic had a weight of 0.1. We use the standard resnet18 [22], and simple data augmentation (small translations, and color jittering provided with pytorch) for all experiments.
Results of Table 2
We use VAT with and . VAT, conditional entropy, and classification losses all had a weight of 1, while the domain critic had a weight of 0.1. We do not use Virtual MixUp. Fine Tuning (for classification) consists in one more epoch, without discriminator loss, and with the loss intuitively reweighted by the error of each domain (0.25 for MNIST to not forget, 0.75 for SVHN to improve). We use the large Convnet for SVHN from the VAT paper [37], and simple data augmentation (small translations, and color jittering provided with pytorch) for all experiments.
Results of Table 3
Appendix C Details of the Analyses of Section 4.3
In this section, we give details of the implementation of UMAP and LRP analyses performed on the learned representations and classifiers. We also provide further UMAP visualizations.
c.1 Application of UMAP
Implementation
To compute the twodimensional UMAP embeddings of the learned representations, we used the official implementation of UMAP [35], with the Euclidean distance and a number of neighbors of 75. We kept all other parameters as default.
Observations
In addition to the embeddings shown in the main paper, we show in Figure 5 further embeddings corresponding to the networks presented in Table 2 of the main paper. Beyond the obvious improvement over untrained features, we observe that supervised and semisupervised approaches in (b) and (c), extract class structure (visible as distinct clusters), but tends to not produce strongly domaininvariant representations. Our method (d) incorporates domain alignment in the objective and we observe a much stronger overlap between the red and blue points representing the MNIST and SVHN domains respectively. However, we have shown in the paper that a simple epoch of fine tuning can lead to higher accuracies of classification, at the cost of a higher Wasserstein distance, which we observe here in (e) by a worsened domain alignment compared to (d).
c.2 Application of LRP
We apply the LRP method for the task of attributing the predictions of the considered ConvLarge architecture [37]
to the input pixels and color channels. The ConvLarge architecture is composed of an alternation of convolutions, batchnormalizations, Leaky ReLUs, and maxpooling functions. Before applying LRP, we adopt the strategy described in
[38] of fusing batchnormalization layers into the parameters of the adjacent convolution layers, so that we arrive at a simplified but functionally equivalent neural network which consists only of maxpooling layers and convolutionleakyReLU layers. For the maxpooling layers () we adopt the commonly used winnertakeall redistribution [3], i.e. we redistribute theto the neuron in the pool that has the maximum activation. For the convolutionleakyReLU layers, we extend the LRP
rule defined in [38] to account for negative input and output activations. Writing such layers as:where the convolution is written as a generic weighted sum, where indicates that we sum over all input nodes plus a bias ( with ), and where
is the leaky ReLU parameter, we define the rule: