Robust Domain Adaptation: Representations, Weights and Inductive Bias

06/24/2020 ∙ by Victor Bouvier, et al. ∙ 0

Unsupervised Domain Adaptation (UDA) has attracted a lot of attention in the last ten years. The emergence of Domain Invariant Representations (IR) has improved drastically the transferability of representations from a labelled source domain to a new and unlabelled target domain. However, a potential pitfall of this approach, namely the presence of label shift, has been brought to light. Some works address this issue with a relaxed version of domain invariance obtained by weighting samples, a strategy often referred to as Importance Sampling. From our point of view, the theoretical aspects of how Importance Sampling and Invariant Representations interact in UDA have not been studied in depth. In the present work, we present a bound of the target risk which incorporates both weights and invariant representations. Our theoretical analysis highlights the role of inductive bias in aligning distributions across domains. We illustrate it on standard benchmarks by proposing a new learning procedure for UDA. We observed empirically that weak inductive bias makes adaptation more robust. The elaboration of stronger inductive bias is a promising direction for new UDA algorithms.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deploying machine learning models in the real world often requires the ability to generalize to

unseen samples i.e.

samples significantly different from those seen during learning. Despite impressive performances on a variety of tasks, deep learning models do not always meet these requirements

[3, 14]. For this reason, out-of-distribution generalization is recognized as a major challenge for the reliability of machine learning systems [1, 2]. Domain Adaptation (DA) [30, 28] is a well-studied approach to bridge the gap between train and test distributions. In DA, we refer to train and test distributions as source and target respectively noted and where are inputs and

are labels. The objective of DA can be defined as learning a good classifier on a poorly sampled target domain by leveraging samples from a source domain. Unsupervised Domain Adaptation (UDA) assumes that only unlabelled data from the target domain is available during training. In this context, a natural assumption, named

Covariate shift [33, 19], consists in assuming that the mapping from the inputs to the labels is conserved across domains, i.e. . In this context, Importance Sampling (IS) performs adaptation by weighting the contribution of sample in the loss by [30]. Although IS seems natural when unlabelled data from the target domain is available, the covariate shift assumption is not sufficient to guarantee successful adaptation [5]

. Moreover, for high dimensional data

[12] such as texts or images, the shift between and results from non-overlapping supports leading to unbounded weights [20].

In this particular context, representations can help to reconcile non-overlapping supports [5]. This seminal idea, and the corresponding theoretical bound of the target risk from [5], has led to a wide variety of deep learning approaches [13, 23, 24] which aim to learn a so-called domain invariant representation:


where for a given non-linear representation . These assume that the transferability of representations, defined as the combined error of an ideal classifier, remains low during learning. Unfortunately, this quantity involves target labels and is thus intractable. More importantly, looking for strict invariant representations, , hurts the transferability of representations [20, 22, 36, 40]. In particular, there is a fundamental trade-off between learning invariant representations and preserving transferability in presence of label shift () [40]. To mitigate this trade-off, some recent works suggest to relax domain invariance by weighting samples [8, 36, 37, 9]. This strategy differs with (1) by aligning a weighted source distribution with the target distribution:


for some weights . We now have two tools, and , which need to be calibrated to obtain distribution alignment. Which one should be promoted? How weights preserve good transferability of representations?

While most prior works focus on the invariance error for achieving adaptation [13, 23, 24], this paper focuses on the transferability of representations. We show that weights allow to design an interpretable generalization bound where transferability and invariance errors are uncoupled. In addition, we discuss the role of inductive design for both the classifier and the weights in addressing the lack of labelled data in the target domain. Our contributions are the following:

  1. We introduce a new bound of the target risk which incorporates both weights and domain invariant representations. Two new terms are introduced. The first is an invariance error which promotes alignment between a weighted source distribution of representations and the target distribution of representations. The second, named transferability error, involves labelling functions from both source and target domains.

  2. We highlight the role of inductive bias for approximating the transferability error. First, we establish connections between our bound and popular approaches for UDA which use target predicted labels during adaptation, in particular Conditional Domain Adaptation [24] and Minimal Entropy [15]. Second, we show that the inductive design of weights has an impact on representation invariance.

  3. We derive a new learning procedure for UDA. The particularity of this procedure is to only minimize the transferability error while controlling representation invariance with weights. Since the transferability error involves target labels, we use the predicted labels during learning.

  4. We provide an empirical illustration of our framework on two DA benchmarks (Digits and Office31 datasets). We stress-test our learning scheme by modifying strongly the label distribution in the source domain. While methods based on invariant representations deteriorate considerably in this context, our procedure remains robust.

2 Preliminaries

We introduce the source distribution i.e. data where the model is trained with supervision and the target distribution i.e.

data where the model is tested or applied. Formally, for two random variables

on a given space , we introduce two distributions: the source distribution and the target distribution

. Here, labels are one-hot encoded

i.e. such that where is the number of classes. The distributional shift situation is then characterized by [30]. In the rest of the paper, we use the index notation and to differentiate source and target terms. We define the hypothesis class as a subset of functions from to which is the composition of a representation class and a classifier class , i.e. . For the ease of reading, given a classifier and a representation , we note . Furthermore, in the definition , we refer indifferently to , , as the representation. For two given and and the loss , the risk in domain is noted:


and . In the seminal works [5, 27], a theoretical limit of the target risk when using a representation has been derived:

Bound 1 (Ben David et al.)

Let and , :


This generalization bound ensures that the target risk ) is bounded by the sum of the source risk ), the disagreement risk between two classifiers from representations , and a third term, , which quantifies the ability to perform well in both domains from representations. The latter is referred to as the adaptability error of representations. It is intractable in practice since it involves labels from the target distribution. Promoting distribution invariance of representations, i.e. close to , results on a low . More precisely:


where is the so-called set of discriminators or critics which verifies where is the function [13]. Since the domain invariance term is expressed as a supremal value on classifiers, it is suitable for domain adversarial learning with critic functions. Conversely, the adaptability error is expressed as an infremal value. This ’’ duality induces an unexpected trade-off when learning domain invariant representations:

Proposition 1 (Invariance hurts adaptability [20, 40])

Let be a representation which is a richer feature extractor than : . Then,


As a result of proposition 1, the benefit of representation invariance must be higher than the loss of adaptability, which is impossible to guarantee in practice.

3 Theory

To overcome the limitation raised in proposition 1, we expose a new bound of the target risk which embeds a new trade-off between invariance and transferability (3.1). We show this new bound remains inconsistent with the presence of label shift (3.2) and we expose the role of weights to address this problem (3.3).

3.1 A new trade-off between Invariance and Transferability

3.1.1 Core assumptions.

Our strategy is to express both the transferability and invariance as a supremum using Integral Probability Measure (IPM) computed on a critic class. We thus introduce a class of critics suitable for our analysis. Let

from and from with the following properties:

  • (A1) and are symmetric (i.e. ) and convex.

  • (A2) and .

  • (A3) , . 111See Appendix 0.A.1 for more details on this assumption.

  • (A4) For two distributions and on , if and only if:


The assumption (A1) ensures that rather comparing two given and , it is enough to study the error of some from . This brings back a supremum on to a supremum on . The assumption (A2), combined with (A1), ensures that an error can be expressed as a critic function such that . The assumption (A3) ensures that is rich enough to contain label function from representations. Here,

is a vector of probabilities on classes:

. The last assumption (A4) ensures that the introduced IPM is a distance. Classical tools verify these assumptions e.g. continuous functions; here is the Maximum Mean Discrepancy [16] and one can reasonably believe that and are continuous.

3.1.2 Invariance and transferability as IPMs.

We introduce here two important tools that will guide our analysis:

  • , named invariance error, that aims at capturing the difference between source and target distribution of representations, corresponding to:

  • , named transferability error, that catches if the coupling between and shifts across domains. For that, we use our class of functions and we compute the IPM of , where and is the scalar product222the scalar product between and emerges from the choice of the loss., between the source and the target domains:


3.1.3 A new bound of the target risk.

Using and , we can provide a new bound of the target risk:

Bound 2

and :


The proof is in Appendix 0.A.1. In contrast with bound 1 (Eq. 6), here two IPMs are involved to compare representations ( and ). A new term, , reflects the level of noise when fitting labels from representations. All the trade-off between invariance and transferability is embodied in this term:

Proposition 2

Let a representation which is a richer feature extractor than : and . is more domain invariant than :


where and . Proof in 0.A.2.

Bounding the target risk using IPMs has two advantages. First, it allows to better control the invariance / transferability trade-off since . This is paid at the cost of (see Proposition 6 in Appendix 0.A.1). Second, is source free and indicates whether there is enough information in representations for learning the task in the target domain at first. This means that is only dedicated to control if aligned representations have the same labels across domains. To illustrate the interest of our new transferability error, we provide visualisation of representations (Fig. 1) when trained to minimize the adaptability error from bound 1 and the transferability error from bound 2.

(a) adaptability in bound 1 from [5]. Inside class clusters, source and target representations are separated.
(b) transferability from bound 2 (contribution). Inside class clusters, source and target representations are not distinguishable
Figure 1: t-SNE [26] visualisation of representations when trained to minimize (a) adaptability error from [5], (b) transferability error introduced in the present work. The task used is AW of the Office31 dataset. Labels in the target domain are used during learning in this specific experiment. For both visualisations of representations, we observe well-separated clusters associated to the label classification task. Inside those clusters, we observe a separation between source and target representations for . That means that representations embed domain information and thus are not invariant. On the contrary, source and target representations are much more overlapping inside of each cluster with , illustrating that this new term is not conflictual with invariance.

3.2 A detailed view on the property of tightness

An interesting property of the bound, named tightness, is the case when and simultaneously. The condition of tightness of the bound provides rich information on the properties of representations.

Proposition 3

if and only if .

The proof is given in Appendix 0.A.3. Two important points should be noted:

  1. ensures that , using (A4). Similarly, leads to . Since implies , does not bring more substantial information about representations distribution than . More precisely, one can show that noting that when for .

  2. Second, the equality also implies that . Therefore, in the context of label shift (when ), the transferability error cannot be null. This is a big hurdle since it is clearly established that most real world UDA tasks exhibit some label shift. This bound highlights the fact that representation invariance alone can not address UDA in complex settings such as the label shift one.

3.3 Reconciling Weights and Invariant Representations.

Based on the interesting observations from [20, 40] and following the line of study that proposed to relax invariance using weights [9, 38, 37, 36], we propose to adapt the bound by incorporating weights. More precisely, we study the effect of modifying the source distribution to a weighted source distribution where is a positive function which verifies . By replacing by (distribution referred as ) in bound 2, we obtain a new bound of the target risk incorporating both weights and representations:

Bound 3

such that :

where and .

As for the previous bound 2, the property of tightness, i.e. when invariance and transferability are null simultaneously, leads to interesting observations:

Proposition 4

if and only if and . The proof is given in Appendix 0.A.4.

This proposition means that the nullity of invariance error, i.e. , implies distribution alignment, i.e. . This is of strong interest since both representations and weights are involved for achieving domain invariance. The nullity of the transferability error, i.e. , implies that labelling functions, , are conserved across domains. Furthermore, the equality interestingly resonates with a recent line of work called Invariant Risk Minimization (IRM) [2]. Incorporating weights in the bound thus brings two benefits:

  1. First, it raises the inconsistency issue of invariant representations in presence of label shift, as mentioned in section 3. Indeed, tightness is not conflicting with label shift.

  2. and have two disctinct roles: the former promotes domain invariance of representations while the latter controls whether aligned representations share the same labels across domains.

4 The role of Inductive Bias

Inductive Bias

refers to the set of assumptions which improves generalization of a model trained on an empirical distribution. For instance, a specific neural network architecture or a well-suited regularization are prototypes of inductive biases. First, we provide a theoretical analysis of the role of inductive bias for addressing the lack of labelling data in the target domain (

4.1), which is the most challenging part of Unsupervised Domain Adaptation. Second, we describe the effect of weights to induce invariance property on representations (4.2).

4.1 Inductive design of a classifier

4.1.1 General Formulation.

Our strategy consists in approximating target labels error through a classifier . We refer to the latter as the inductive design of the classifier. Our proposition follows the intuitive idea which states that the best source classifier, , is not necessarily the best target classifier i.e. . For instance, a well-suited regularization in the target domain, noted may improve performance, i.e. setting may lead to . We formalize this idea through the following definition:

Definition 1 (Inductive design of a classifier)

We say that there is an inductive design of a classifier at level if for any representations , noting , we can determine such that:


We say the inductive design is strong when and weak when .

In this definition, does not depend of , which is a strong assumption, and embodies the strength of the inductive design. The closer to 1 is , the less improvement we can expect using the inductive classifier . We now study the impact of the inductive design of a classifier in our previous bound 3. Thus, we introduce the approximated transferability error:


leading to a bound of the target risk where transferability is target labels free:

Bound 4 (Inductive Bias and Guarantee)

Let and such that and a strong inductive classifier and then:


The proof is given in Appendix 0.A.5. Here, the target labels are only involved in which reflects the level of noise when fitting labels from representations. Therefore, transferability is now free of target labels. This is an important result since the difficulty of UDA lies in the lack of labelled data in the target domain. It is also interesting to note that the weaker the inductive bias (), the higher the bound and vice versa.

4.1.2 The role of predicted labels.

Predicted labels play an important role in UDA. In light of the inductive classifier, this means that is simply set as . This is a weak inductive design (), thus, theoretical guarantee from bound 4 is not applicable. However, there is empirical evidence that showed that predicted labels help in UDA [15, 24]. It suggests that this inductive design may find some strength in the finite sample regime. A better understanding of this phenomenon is left for future work (See Appendix 0.B). In the rest of the paper, we study this weak inductive bias by establishing connections between and popular approaches of the literature.

Connections with Conditional Domain Adaptation Network.

CDAN [24]

aims to align the joint distribution

across domains, where

are estimated labels. It is performed by exposing the tensor product between

and to a discriminator. It leads to substantial empirical improvements compared to Domain Adversarial Neural Networks (DANN) [13]. We can observe that it is a similar objective to in the particular case where .

Connections with Minimal Entropy.

MinEnt [15] states that an adapted classifier is confident in prediction on target samples. It suggests the regularization: where is the entropy. If labels are smooth enough (i.e. it exists such that ), MinEnt is a lower bound of transferability: for some and is the cross-entropy between and on (see Appendix 0.A.6).

4.2 Inductive design of weights

While the bounds introduced in the present work involve weights in the representation space, there is an abundant literature that builds weights in order to relax the domain invariance of representations [8, 36, 37, 9]. We study the effect of inductive design of

on representations. To conduct the analysis, we consider there is a non-linear transformation

from to and we assume that weights are computed in , i.e. is a function of . We refer to this as inductive design of weights. For instance, in the particular case where , weights are designed as [9] where . In [24], entropy conditioning is introduced by designing weights where is the predictions entropy. The inductive design of weights imposes invariance property on representations:

Proposition 5 (Inductive design of and invariance)

Let such that and . Let such that and we note . Then, if and only if:


while both and . The proof is given in Appendix 0.A.7.

This proposition shows that the design of has a significant impact on the property of domain invariance of representations. Furthermore, both labelling functions are conserved. In the rest of the paper we focus on weighting in the representation space which consists in:


Since it does not leverage any transformations of representations , we refer to this approach as a weak inductive design of weights. It is worth noting this inductive design controls naturally the invariance error i.e. .

5 Towards Robust Domain Adaptation

In this section, we expose a new learning procedure which relies on weak inductive design of both weights and the classifier. This procedure focuses on the transferability error since the inductive design of weights naturally controls the invariance error. Our learning procedure is then a bi-level optimization problem, named RUDA (Robust UDA):


where is a trade-off parameter. Two discriminators are involved here. The former is a domain discriminator trained to map 1 for source representations and 0 for target representations by minimizing a domain adversarial loss:


where and are respectively the parameters of and , and and are respectively the number of samples in the source and target domains. Setting weights ensures that is minimal (See Appendix 0.C.2). The latter, noted , maps representations to the label space in order to obtain a proxy of the transferability error expressed as a domain adversarial objective (See Appendix 0.C.1):


where and are respectively parameters of and . Furthermore, we use the cross-entropy loss in the source weighted domain for learning :


Finally, the optimization is then expressed as follows:


Losses are minimized by stochastic gradient descent (SGD) where in practice

and are gradient reversal layers [13]. The trade-off parameter is pushed from 0 to during training. We provide an implementation in Pytorch [29] based on [24]. The algorithm procedure is described in Appendix 0.C.5.

6 Experiments

6.1 Setup


We investigate two digits datasets: MNIST and USPS transfer tasks MNIST to USPS (MU) and USPS to MNIST (UM). We used standard train / test split for training and evaluation. Office-31 is a dataset of images containing objects spread among 31 classes captured from different domains: Amazon, DSLR camera and a Webcam camera. DSLR and Webcam are very similar domains but images differ by their exposition and their quality.

Label shifted datasets.

We stress-test our approach by investigating more challenging settings where the label distribution shifts strongly across domains. For the Digits dataset, we explore a wide variety of shifts by keeping only , , and of digits between 0 and 5 of the original dataset (refered as ). We have investigated the tasks UM and MU. For the Office-31 dataset, we explore the shift where the object spread in classes 16 to 31 are duplicated 5 times (refered as ). Shifting distribution in the source domain rather than the target domain allows to better appreciate the drop in performances in the target domain compared to the case where the source domain is not shifted.

Comparison with the state-of-the-art.

For all tasks, we report results from DANN [13] and CDAN [24]. To study the effect of weights, we name our method RUDA when weights are set to 1, and RUDA when weights are used. For the non-shifted datasets, we report a weighted version of CDAN (entropy conditioning CDAN+E [24]). For the label shifted datasets, we report IWAN [38], a weighted DANN where weights are learned from a second discriminator, and CDAN a weighted CDAN where weihghts are added in the same setting than RUDA.

Training details.

Models are trained during 20.000 iterations of SGD. We report end of training accuracy in the target domain averaged on five random seeds. The model for the Office-31 dataset uses a pretrained ResNet-50 [18]. We used the same hyper-parameters than [24] which were selected by importance weighted cross-validation [35]. The trade-off parameters is smoothly pushed from 0 to 1 as detailed in [24]. To prevent from noisy weighting in early learning, we used weight relaxation: based on the sigmoid output of discriminator , we used and weights . is decreased to 1 during training: where , is the training progress. In all experiments, is set to (except for where , see Appendix 0.C.3 for more details).

6.2 Results

Unshifted datasets.

On both Office-31 (Table 1) and Digits (Table 2), RUDA performs similarly than CDAN. Simply performing the scalar product allows to achieve results obtained by multi-linear conditioning [24]. This presents a second advantage: when domains exhibit a large number of classes, e.g. in Office-Home (See Appendix), our approach does not need to leverage a random layer. It is interesting to observe that we achieve performances close to CDAN+E on Office-31 while we do not use entropy conditioning. However, we observe a substantial drop in performance when adding weights, but still get results comparable with CDAN in Office-31. This is a deceiptive result since those datasets naturally exhibit label shift; one can expect to improve the baselines using weights. We did not observe this phenomenon on standard benchmarks.

Method AW WA AD DA DW WD Avg
   Standard ResNet-50 68.4 0.2 60.7 0.3 68.9 0.2 62.5 0.3 96.7 0.1 99.3 0.1 76.1
DANN 82.0 0.4 67.4 0.5 79.7 0.4 68.2 0.4 96.9 99.1 82.2
CDAN 93.1 0.2 68.0 0.4 89.8 0.3 70.1 0.4 98.2 0.2 100. 0.0 86.6
CDAN+E 94.1 0.1 69.3 0.4 92.9 0.2 71.0 0.3 98.6 0.1 100. 0.0 87.7
RUDA 94.3 0.3 70.7 0.3 92.1 0.3 70.7 0.1 98.5 0.1 100. 0.0 87.6
RUDA 92.0 0.3 67.9 0.3 91.1 0.3 70.2 0.2 98.6 0.1 100. 0.0 86.6
   ResNet-50 72.4 0.7 59.5 0.1 79.0 0.1 61.6 0.3 97.8 0.1 99.3 0.1 78.3
DANN 67.5 0.1 52.1 0.8 69.7 0.0 51.5 0.1 89.9 75.9 67.8
CDAN 82.5 0.4 62.9 0.6 81.4 0.5 65.5 0.5 98.5 0.3 99.8 0.0 81.6
RUDA 85.4 0.8 66.7 0.5 81.3 0.3 64.0 0.5 98.4 0.2 99.5 0.1 82.1
IWAN 72.4 0.4 54.8 0.8 75.0 0.3 54.8 1.3 97.0 95.8 75.0
CDAN 81.5 64.5 0.4 80.7 1.0 65 0.8 98.7 0.2 99.9 0.1 81.8
RUDA 87.4 0.2 68.3 0.3 82.9 0.4 68.8 0.2 98.7 0.1 100. 0.0 83.8
Table 1: Accuracy () on the Office-31 dataset.
Method U M MU
Shift of 5% 10% 15% 20% 100% Avg 5% 10% 15% 20% 100% Avg Avg
DANN 41.7 51.0 59.6 69.0 94.5 63.2 34.5 51.0 59.6 63.6 90.7 59.9 63.2
CDAN 50.7 62.2 82.9 82.8 96.9 75.1 32.0 69.7 78.9 81.3 93.9 71.2 73.2
RUDA 44.4 58.4 80.0 84.0 95.5 72.5 34.9 59.0 76.1 78.8 93.3 68.4 70.5
IWAN 73.7 74.4 78.4 77.5 95.7 79.9 72.2 82.0 84.3 86.0 92.0 83.3 81.6
CDAN 68.3 78.8 84.9 88.4 96.6 83.4 69.4 80.0 83.5 87.8 93.7 82.9 83.2
RUDA 78.7 82.8 86.0 86.9 93.9 85.7 78.7 87.9 88.2 89.3 92.5 87.3 86.5
Table 2: Accuracy () on the Digits dataset.
Label shifted datasets.

We stress-tested our approach by applying strong label shifts to the datasets. First, we observe a drop in performance for all methods based on invariant representations compared with the situation without label shift. This is consistent with works that warn the pitfall of domain invariant representations in presence of label shift [20, 40]. RUDA and CDAN perform similarly even in this setting. It is interesting to note that the weights improve significantly RUDA results (+1.7% on Office-31 and +16.0% on Digits both in average) while CDAN seems less impacted by them (+0.2% on Office-31 and +10.0% on Digits both in average).

Should we use weights?

To observe a significant benefit of weights, we had to explore situations with strong label shift e.g. and for the Digits dataset. Apart from this cases, weights bring small gain (e.g. + 1.7% on Office-31 for RUDA) or even degrade marginally adaptation. Understanding why RUDA and CDAN are able to address small label shift, without weights, is of great interest for the development of more robust UDA.

7 Related work

This paper makes several contributions, both in terms of theory and algorithm. Concerning theory, our bound provides a risk suitable for domain adversarial learning with weighting strategies. Existing theories for non-overlapping supports [4, 27] and importance sampling [11, 30] do not explore the role of representations neither the aspect of adversarial learning. In [5], analysis of representation is conducted and connections with our work is discussed in the paper. The work [20] is close to ours and introduces a distance which measures support overlap between source and target distributions under covariate shift. Our analysis does not rely on such assumption, its range of application is broader.

Concerning algorithms, the covariate shift adaptation has been well-studied in the literature [19, 17, 35]. Importance sampling to address label shift has also been investigated [34], notably with kernel mean matching [39] and Optimal Transport [31]. Recently, a scheme for estimating labels distribution ratio with consistency guarantee has been proposed [21]. Learning domain invariant representations has also been investigated in the fold of [13, 23] and mainly differs by the metric chosen for comparing distribution of representations. For instance, metrics are domain adversarial (Jensen divergence) [13, 24], IPM based such as MMD [23, 25] or Wasserstein [6, 32]. Our work provides a new theoretical support for these methods since our analysis is valid for any IPM.

Using both weights and representations is also an active topic, namely for Partial Domain Adaptation (PADA) [9], when target classes are strict subset of the source classes, or Universal Domain Adaptation [37], when new classes may appear in the target domain. [9]

uses an heuristic based on predicted labels for re-weighting representations. However, it assumes they have a good classifier at first in order to obtain cycle consistent weights.

[38] uses a second discriminator for learning weights, which is similar to [8]. Applying our framework to Partial DA and Universal DA is an interesting future direction. Our work shares strong connections with [10] (authors were not aware of this work during the elaboration of this paper) which uses consistent estimation of true labels distribution from [21]. We suggest a very similar empirical evaluation and we also investigate the effect of weights on CDAN loss [24] with a different weighting scheme since our approach computes weights in the representation space. All these works rely on an assumption at some level, e.g. Generalized Label Shift in [10], when designing weighting strategies. Our discussion on the role of inductive design of weights may provide a new theoretical support for these approaches.

8 Conclusion

The present work introduces a new bound of the target risk which unifies weights and representations in UDA. We conduct a theoretical analysis of the role of inductive bias when designing both weights and the classifier. In light of this analysis, we propose a new learning procedure which leverages two weak inductive biases, respectively on weights and the classifier. To the best of our knowledge, this procedure is original while being close to straightforward hybridization of existing methods. We illustrate its effectiveness on two benchmarks. The empirical analysis shows that weak inductive bias can make adaptation more robust even when stressed by strong label shift between source and target domains. This work leaves room for in-depth study of stronger inductive bias by providing both theoretical and empirical foundations.


Victor Bouvier is funded by Sidetrade and ANRT (France) through a CIFRE collaboration with CentraleSupélec. Authors thank the anonymous reviewers for their insightful comments for improving the quality of the paper. This work was performed using HPC resources from the “Mésocentre” computing center of CentraleSupélec and École Normale Supérieure Paris-Saclay supported by CNRS and Région Île-de-France (


  • [1] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané (2016) Concrete problems in ai safety. arXiv preprint arXiv:1606.06565. Cited by: §1.
  • [2] M. Arjovsky, L. Bottou, I. Gulrajani, and D. Lopez-Paz (2019) Invariant risk minimization. arXiv preprint arXiv:1907.02893. Cited by: §1, §3.3.
  • [3] S. Beery, G. Van Horn, and P. Perona (2018) Recognition in terra incognita. In

    Proceedings of the European Conference on Computer Vision (ECCV)

    pp. 456–473. Cited by: §1.
  • [4] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan (2010) A theory of learning from different domains. Machine learning 79 (1-2), pp. 151–175. Cited by: §7.
  • [5] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira (2007) Analysis of representations for domain adaptation. In Advances in neural information processing systems, pp. 137–144. Cited by: §1, §1, §2, Figure 1, 1(a), §7.
  • [6] B. Bhushan Damodaran, B. Kellenberger, R. Flamary, D. Tuia, and N. Courty (2018) Deepjdot: deep joint distribution optimal transport for unsupervised domain adaptation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 447–463. Cited by: §7.
  • [7] L. Bottou, M. Arjovsky, D. Lopez-Paz, and M. Oquab (2018) Geometrical insights for implicit generative modeling. In Braverman Readings in Machine Learning. Key Ideas from Inception to Current State, pp. 229–268. Cited by: §0.C.1.
  • [8] Y. Cao, M. Long, and J. Wang (2018) Unsupervised domain adaptation with distribution matching machines. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §1, §4.2, §7.
  • [9] Z. Cao, L. Ma, M. Long, and J. Wang (2018) Partial adversarial domain adaptation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 135–150. Cited by: §1, §3.3, §4.2, §7.
  • [10] R. T. d. Combes, H. Zhao, Y. Wang, and G. Gordon (2020) Domain adaptation with conditional distribution matching and generalized label shift. arXiv preprint arXiv:2003.04475. Cited by: §7.
  • [11] C. Cortes, Y. Mansour, and M. Mohri (2010) Learning bounds for importance weighting. In Advances in neural information processing systems, pp. 442–450. Cited by: §7.
  • [12] A. D’Amour, P. Ding, A. Feller, L. Lei, and J. Sekhon (2017) Overlap in observational studies with high-dimensional covariates. arXiv preprint arXiv:1711.02582. Cited by: §1.
  • [13] Y. Ganin and V. Lempitsky (2015)

    Unsupervised domain adaptation by backpropagation

    In International Conference on Machine Learning, pp. 1180–1189. Cited by: §0.C.1, §1, §1, §2, §4.1.2, §5, §6.1, §7.
  • [14] M. Geva, Y. Goldberg, and J. Berant (2019) Are we modeling the task or the annotator? an investigation of annotator bias in natural language understanding datasets. In

    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

    pp. 1161–1166. Cited by: §1.
  • [15] Y. Grandvalet and Y. Bengio (2005) Semi-supervised learning by entropy minimization. In Advances in neural information processing systems, pp. 529–536. Cited by: §0.A.6, item 2, §4.1.2, §4.1.2.
  • [16] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola (2012) A kernel two-sample test. Journal of Machine Learning Research 13 (Mar), pp. 723–773. Cited by: §3.1.1.
  • [17] A. Gretton, A. Smola, J. Huang, M. Schmittfull, K. Borgwardt, and B. Schölkopf (2009) Covariate shift by kernel mean matching. Dataset shift in machine learning 3 (4), pp. 5. Cited by: §7.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §6.1.
  • [19] J. Huang, A. Gretton, K. Borgwardt, B. Schölkopf, and A. J. Smola (2007) Correcting sample selection bias by unlabeled data. In Advances in neural information processing systems, pp. 601–608. Cited by: §1, §7.
  • [20] F. Johansson, D. Sontag, and R. Ranganath (2019) Support and invertibility in domain-invariant representations. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 527–536. Cited by: §1, §1, §3.3, §6.2, §7, Proposition 1.
  • [21] Z. Lipton, Y. Wang, and A. Smola (2018) Detecting and correcting for label shift with black box predictors. In International Conference on Machine Learning, pp. 3122–3130. Cited by: §7, §7.
  • [22] H. Liu, M. Long, J. Wang, and M. Jordan (2019) Transferable adversarial training: a general approach to adapting deep classifiers. In International Conference on Machine Learning, pp. 4013–4022. Cited by: §1.
  • [23] M. Long, Y. Cao, J. Wang, and M. I. Jordan (2015) Learning transferable features with deep adaptation networks. In Proceedings of the 32nd International Conference on International Conference on Machine Learning-Volume 37, pp. 97–105. Cited by: §1, §1, §7.
  • [24] M. Long, Z. Cao, J. Wang, and M. I. Jordan (2018) Conditional adversarial domain adaptation. In Advances in Neural Information Processing Systems, pp. 1640–1650. Cited by: Appendix 0.B, item 2, §1, §1, §4.1.2, §4.1.2, §4.2, §5, §6.1, §6.1, §6.2, §7, §7.
  • [25] M. Long, H. Zhu, J. Wang, and M. I. Jordan (2017)

    Deep transfer learning with joint adaptation networks

    In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2208–2217. Cited by: §7.
  • [26] L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: Figure 1.
  • [27] Y. Mansour, M. Mohri, and A. Rostamizadeh (2009) Domain adaptation: learning bounds and algorithms. In 22nd Conference on Learning Theory, COLT 2009, Cited by: §2, §7.
  • [28] S. J. Pan and Q. Yang (2009) A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22 (10), pp. 1345–1359. Cited by: §1.
  • [29] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §5.
  • [30] J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence (2009) Dataset shift in machine learning. The MIT Press. Cited by: §1, §2, §7.
  • [31] I. Redko, N. Courty, R. Flamary, and D. Tuia (2018) Optimal transport for multi-source domain adaptation under target shift. arXiv preprint arXiv:1803.04899. Cited by: §7.
  • [32] J. Shen, Y. Qu, W. Zhang, and Y. Yu (2018) Wasserstein distance guided representation learning for domain adaptation. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §7.
  • [33] H. Shimodaira (2000) Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference 90 (2), pp. 227–244. Cited by: §1.
  • [34] A. Storkey (2009) When training and test sets are different: characterizing learning transfer. Dataset shift in machine learning, pp. 3–28. Cited by: §7.
  • [35] M. Sugiyama, M. Krauledat, and K. MÞller (2007) Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research 8 (May), pp. 985–1005. Cited by: §6.1, §7.
  • [36] Y. Wu, E. Winston, D. Kaushik, and Z. Lipton (2019) Domain adaptation with asymmetrically-relaxed distribution alignment. In International Conference on Machine Learning, pp. 6872–6881. Cited by: §1, §3.3, §4.2.
  • [37] K. You, M. Long, Z. Cao, J. Wang, and M. I. Jordan (2019) Universal domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2720–2729. Cited by: §1, §3.3, §4.2, §7.
  • [38] J. Zhang, Z. Ding, W. Li, and P. Ogunbona (2018) Importance weighted adversarial nets for partial domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8156–8164. Cited by: §3.3, §6.1, §7.
  • [39] K. Zhang, B. Schölkopf, K. Muandet, and Z. Wang (2013) Domain adaptation under target and conditional shift. In International Conference on Machine Learning, pp. 819–827. Cited by: §7.
  • [40] H. Zhao, R. T. Des Combes, K. Zhang, and G. Gordon (2019) On learning invariant representations for domain adaptation. In International Conference on Machine Learning, pp. 7523–7532. Cited by: §1, §3.3, §6.2, Proposition 1.

Appendix 0.A Proofs

We provide full proof of bounds and propositions presented in the paper.

0.a.1 Proof of bound 2

We give a proof of bound 2 which states:


First, we prove the following lemma:

Bound 5 (Revisit of theorem 1)



This is simply obtained using triangular inequalites:

Now using (A3) () :


which shows that: and we use the property of conditional expectation .

Second, we bound .

Proposition 6



We remind that . Since (A1) ensures , , then and finally . Furthermore, (A2) ensures that which leads finally to the announced result.

Third, we bound .

Proposition 7



We note and we omit for the ease of reading

Since does not intervene in , we show this term behaves similarly than . First,

(Using (A1))
(Using (A3))
(Using (A2))