1 Introduction
The recent successes of supervised deep learning methods have been partially attributed to rich datasets and increasing computational power. However, in many critical applications, e.g., selfdriving cars or personal healthcare, it is often prohibitively expensive and timeconsuming to collect largescale supervised training data. Unsupervised domain adaptation (DA) focuses on such limitations by trying to transfer knowledge from a labeled source domain to an unlabeled target domain, and a large body of work tries to achieve this by exploring domaininvariant structures and representations to bridge the gap. Theoretical results
(BenDavid et al., 2010; Mansour et al., 2009a; Mansour and Schain, 2012) and algorithms (Glorot et al., 2011; Becker et al., 2013; Ajakan et al., 2014; Adel et al., 2017; Pei et al., 2018) under this setting are abundant.Due to the ability of deep neural nets to learn rich feature representations, recent advances in domain adaptation have focused on using these networks to learn invariant representations, i.e., intermediate features whose distribution is the same in source and target domains, while at the same time achieving small error on the source domain. The hope is that the learnt intermediate representation, together with the hypothesis learnt using labeled data from the source domain, can generalize to the target domain. Nevertheless, from a theoretical standpoint, it is not at all clear whether aligned representations and small source error are sufficient to guarantee good generalization on the target domain. In fact, despite being successfully applied in various applications (Zhang et al., 2017; Hoffman et al., 2017), it has also been reported that such methods fail to generalize in certain closely related source/target pairs, e.g., digit classification from MNIST to SVHN (Ganin et al., 2016).
Given the wide application of domain adaptation methods based on learning invariant representations, we attempt in this paper to answer the following important and intriguing question: Is finding invariant representations while at the same time achieving a small source error sufficient to guarantee a small target error? If not, under what conditions is it? Contrary to common belief, we give a negative answer to the above question by constructing a simple example showing that these two conditions are not sufficient to guarantee target generalization, even in the case of perfectly aligned representations between the source and target domains. In fact, our example shows that the objective of learning invariant representations while minimizing the source error can actually be hurtful, in the sense that the better the objective, the larger the target error. At a colloquial level, this happens because learning invariant representations can break the originally favorable underlying problem structure, i.e., close labeling functions and conditional distributions. To understand when such methods work, we propose a generalization upper bound as a sufficient condition that explicitly takes into account the conditional shift between source and target domains. The proposed upper bound admits a natural interpretation and decomposition in domain adaptation; we show that it is tighter than existing results in certain cases.
Simultaneously, to understand what the necessary conditions for representation based approaches to work are, we prove an informationtheoretic lower bound on the joint error of both domains for any algorithm based on learning invariant representations. Our result complements the above upper bound and also extends the constructed example to more general settings. The lower bound sheds new light on this problem by characterizing a fundamental tradeoff between learning invariant representations and achieving small joint error on both domains when the marginal label distributions differ from source to target. Our lower bound directly implies that minimizing source error while achieving invariant representation will only increase the target error. We conduct experiments on realworld datasets that corroborate this theoretical implication. Together with the generalization upper bound, our results suggest that adaptation should be designed to align the label distribution as well when learning an invariant representation (c.f. Sec. 4.3). We believe these insights will be helpful to guide the future design of domain adaptation and representation learning algorithms.
2 Preliminary
We first introduce the notations used throughout this paper and review a theoretical model for domain adaptation (DA) (Kifer et al., 2004; BenDavid et al., 2007; Blitzer et al., 2008; BenDavid et al., 2010).
Notations We use and to denote the input and output space, respectively. Similarly, stands for the representation space induced from by a feature transformation . Accordingly, we use
to denote the random variables which take values in
, respectively. In this work, domain corresponds to a distribution on the input space and a labeling function . In the domain adaptation setting, we use and to denote the source and target domains, respectively. A hypothesis is a function . The error of a hypothesis w.r.t. the labeling function under distribution is defined as: . When andare binary classification functions, this definition reduces to the probability that
disagrees with under : . In this work, we focus on the deterministic setting where the output is given by a deterministic labeling function defined on the corresponding domain. For two functions and with compatible domains and ranges, we use to denote the function composition . Other notations will be introduced in the context when necessary.2.1 Problem Setup
We consider the unsupervised domain adaptation problem where the learning algorithm has access to a set of labeled points sampled i.i.d. from the source domain and a set of unlabeled points sampled i.i.d. from the target domain. At a colloquial level, the goal of an unsupervised domain adaptation algorithm is to generalize well on the target domain by learning from labeled samples from the source domain as well as unlabeled samples from the target domain. Formally, let the risk of hypothesis be the error of w.r.t. the true labeling function under domain , i.e.,
. As commonly used in computational learning theory, we denote by
the empirical risk of on the source domain. Similarly, we use and to mean the true risk and the empirical risk on the target domain. The problem of domain adaptation considered in this work can be stated as: under what conditions and by what algorithms can we guarantee that a small training error implies a small test error ? Clearly, this goal is not always possible if the source and target domains are far away from each other.2.2 A Theoretical Model for Domain Adaptation
To measure the similarity between two domains, it is crucial to define a discrepancy measure between them. To this end, BenDavid et al. (2010) proposed the divergence to measure the distance between two distributions:
Definition 2.1 (divergence).
Let be a hypothesis class on input space , and be the collection of subsets of that are the support of some hypothesis in , i.e., . The distance between two distributions and based on is: . ^{1}^{1}1To be precise, BenDavid et al. (2007)’s original definition of divergence has a factor of 2, we choose the current definition as the constant factor is inessential.
divergence is particularly favorable in the analysis of domain adaptation with binary classification problems, and it had also been generalized to the discrepancy distance (Cortes et al., 2008; Mansour et al., 2009a, b; Cortes and Mohri, 2014)
for general loss functions, including the one for regression problems. Both
divergence and the discrepancy distance can be estimated using finite unlabeled samples from both domains when
has a finite VCdimension.One flexibility of the divergence is that its power on measuring the distance between two distributions can be controlled by the richness of the hypothesis class . To see this, first consider the situtation where is very restrictive so that it only contains the constant functions and . In this case, it can be readily verified by the definition that . On the other extreme, if contains all the measurable binary functions, then iff almost surely. In this case the divergence reduces to the distance, or equivalently to the total variation, between the two distributions.
Given a hypothesis class , we define its symmetric difference w.r.t. itself as: , where is the xor operation. Let be the optimal hypothesis that achieves the minimum joint risk on both the source and target domains: , and let denote the joint risk of the optimal hypothesis : . BenDavid et al. (2010) proved the following generalization bound on the target risk in terms of the source risk and the discrepancy between the source and target domains:
Theorem 2.1.
(BenDavid et al., 2010) Let be a hypothesis space of VCdimension and (resp. ) be the empirical distribution induced by sample of size drawn from (resp. ). Then w.p.b. at least , ,
(1) 
The bound depends on , the optimal joint risk that can be achieved by the hypotheses in . The intuition is the following: if is large, we cannot hope for a successful domain adaptation. Later in Sec. 4.3, we shall get back to this term to show an informationtheoretic lower bound on it for any approach based on learning invariant representations.
Theorem 2.1 is the foundation of many recent works on unsupervised domain adaptation via learning invariant representations (Ajakan et al., 2014; Ganin et al., 2016; Zhao et al., 2018b; Pei et al., 2018; Zhao et al., 2018a). It has also inspired various applications of domain adaptation with adversarial learning, e.g., video analysis (Hoffman et al., 2016; Shrivastava et al., 2016; Hoffman et al., 2017; Tzeng et al., 2017), natural language understanding (Zhang et al., 2017; Fu et al., 2017), speech recognition (Zhao et al., 2017; HosseiniAsl et al., 2018), to name a few.
At a high level, the key idea is to learn a rich and parametrized feature transformation such that the induced source and target distributions (on ) are close, as measured by the divergence. We call an invariant representation w.r.t. if , where is the induced source/target distribution. At the same time, these algorithms also try to find new hypothesis (on the representation space ) to achieve a small empirical error on the source domain. As a whole algorithm, these two procedures corresponds to simultaneously finding invariant representations and hypothesis to minimize the first two terms in the generalization upper bound of Theorem 2.1.
3 Related Work
A number of adaptation approaches based on learning invariant representations have been proposed in recent years. Although in this paper we mainly focus on using the divergence to characterize the discrepancy between two distributions, other distance measures can be used as well, e.g., the maximum mean discrepancy (MMD) (Long et al., 2014, 2015, 2016), the Wasserstein distance (Courty et al., 2017b, a; Shen et al., 2018; Lee and Raginsky, 2018), etc.
Under the theoretical framework of the divergence, Ganin et al. (2016)
propose a domain adversarial neural network (DANN) to learn the domain invariant features. Adversarial training techniques that aim to build feature representations that are indistinguishable between source and target domains have been proposed in the last few years
(Ajakan et al., 2014; Ganin et al., 2016). Specifically, one of the central ideas is to use neural networks, which are powerful function approximators, to approximate the divergence between two domains (Kifer et al., 2004; BenDavid et al., 2007, 2010). The overall algorithm can be viewed as a zerosum twoplayer game: one network tries to learn feature representations that can fool the other network, whose goal is to distinguish the representations generated on the source domain from those generated on the target domain. In this work, by showing an informationtheoretic lower bound on the joint error of these methods, we show that although invariant representations can be achieved, it does not necessarily translate to a good generalization on the target domain, in particular when the label distributions of the two domains differ significantly.(Anti) Causal approaches based on conditional and label shifts for domain adaptation also exist (Zhang et al., 2013; Gong et al., 2016; Lipton et al., 2018; Azizzadenesheli et al., 2018). One typical assumption made to simplify the analysis in this line of work is that the source and target domains share the same generative distribution and only differ at the marginal label distributions. Note that this contrasts with the classic covariate shift assumption which instead assumes the source and target domains share the same conditional distribution and differ at the marginal data distribution. It is worth noting that Zhang et al. (2013) showed that conditional shift can be successfully corrected when the changes in follow some parametric families. In this work we focus on representation learning and do not make such explicit assumptions.
4 Theoretical Analysis
Is finding invariant representations alone a sufficient condition for the success of domain adaptation? Clearly it is not. Consider the following simple counterexample: let be a constant function, where , . Then for any discrepancy distance over two distributions, including the divergence, MMD, and the Wasserstein distance, and for any distributions over the input space , we have , where we use (resp. ) to mean the induced source (resp. target) distribution by the transformation over the representation space . Furthermore, it is fairly easy to construct source and target domains , , such that for any hypothesis , , while there exists a classification function that achieves small error, e.g., the labeling function.
One may argue, with good reason, that in the counterexample above, the empirical source error is also large with high probability. Intuitively, this is because the simple constant transformation function fails to retain the discriminative information about the classification task at hand, despite the fact that it can construct invariant representations.
Is finding invariant representations and achieving a small source error sufficient to guarantee small target error? In this section we first give a negative answer to this question by constructing a counterexample where there exists a nontrivial transformation function and hypothesis such that both and are small, while at the same time the target error is large. Motivated by this negative result, we proceed to prove a generalization upper bound that explicitly characterizes a sufficient condition for the success of domain adaptation. We then complement the upper bound by showing an informationtheoretic lower bound on the joint error of any domain adaptation approach based on learning invariant representations.
4.1 Invariant Representation and Small Source Risk are Not Sufficient
In this section, we shall construct a simple 1dimensional example where there exists a function that achieves zero error on both source and target domains. Simultaneously, we show that there exists a continuous transformation function under which the induced source and target distributions are perfectly aligned, but every hypothesis incurs a large joint error on the induced source and target domains. The latter further implies that if we find a hypothesis that achieves small error on the source domain, then it has to incur a large error on the target domain. We illustrate this example in Fig. 1.
Let and . For , we use
to denote the uniform distribution over
. Consider the following source and target domains:In the above example, it is easy to verify that the interval hypothesis iff achieves perfect classification on both domains.
Now consider the following continuous transformation:
Since is a piecewise linear function, it follows that , and for any distance metric over distributions, we have . But now for any hypothesis , and , will make an error in exactly one of the domains, hence
In other words, under the above invariant transformation , the smaller the source error, the larger the target error.
One may argue that this example seems to contradict the generalization upper bound from Theorem 2.1, where the first two terms correspond exactly to a small source error and an invariant representation. The key to explain this apparent contradiction lies in the third term of the upper bound, , i.e., the optimal joint error achievable on both domains. In our example, when there is no transformation applied to the input space, we show that achieves 0 error on both domains, hence . However, when the transformation is applied to the original input space, we prove that every hypothesis has joint error 1 on the representation space, hence . Since we usually do not have access to the optimal hypothesis on both domains, although the generalization bound still holds on the representation space, it becomes vacuous in our example.
An alternative way to interpret the failure of the constructed example is that the labeling functions (or conditional distributions in the stochastic setting) of source and target domains are far away from each other in the representation space. Specifically, in the induced representation space, the optimal labeling function on the source and target domains are:
and we have .
4.2 A Generalization Upper Bound
For most of the practical hypothesis spaces , e.g., half spaces, it is usually intractable to compute the optimal joint error from Theorem 2.1. Furthermore, the fact that contains errors from both domains makes the bound very conservative and loose in many cases. In this section, inspired by the constructed example from Sec. 4.1, we aim to provide a general, intuitive, and interpretable generalization upper bound for domain adaptation that is free of the pessimistic term. Ideally, the bound should also explicitly characterize how the shift between labeling functions of both domains affects domain adaptation. Due to space constraints, we refer the interested reader to the Appendix for the proofs of our technical lemmas, and mainly focus in the following on interpretations and results.
Because of its flexibility in choosing the witness function class and its natural interpretation as adversarial binary classification, we still adopt the divergence to measure the discrepancy between two distributions. For any hypothesis space , it can be readily verified that satisfies the triangular inequality:
where are any distributions over the same space. We now introduce a technical lemma that will be helpful in proving results related to the divergence:
Lemma 4.1.
Let and be two distributions over . Then , , where .
As a matter of fact, the above lemma also holds for any function class (not necessarily a hypothesis space) where there exists a constant , such that for all . Another useful lemma is the following triangular inequality:
Lemma 4.2.
Let and be any distribution over . For any , we have .
Let and be the optimal labeling functions on the source and target domains, respectively. In the stochastic setting,
corresponds to the optimal Bayes classifier. With these notations, the following theorem holds:
Theorem 4.1.
Let and be the source and target domains, respectively. For any function class , and , the following inequality holds:
Remark The three terms in the upper bound have natural interpretations: the first term is the source error, the second one corresponds to the discrepancy between the marginal distributions, and the third measures the distance between the labeling functions from the source and target domains. Altogether, they form a sufficient condition for the success of domain adaptation: besides a small source error, not only do the marginal distributions need to be close, but so do the labeling functions.
Comparison with Theorem 2.1. It is instructive to compare the bound in Theorem 4.1 with the one in Theorem 2.1. The main difference lies in the in Theorem 2.1 and the in Theorem 4.1. depends on the choice of the hypothesis class , while our term does not. In fact, our quantity reflects the underlying structure of the problem, i.e., the conditional shift. Finally, consider the example given in the left panel of Fig. 1. It is easy to verify that we have in this case, while for a natural class of hypotheses, i.e., , we have . In that case, our bound is tighter than the one in Theorem 2.1.
In the covariate shift setting, where we assume the conditional distributions between the source and target domains are the same, the third term in the upper bound vanishes. In that case the above theorem says that to guarantee successful domain adaptation, it suffices to match the marginal distributions while achieving small error on the source domain. In general settings where the optimal labeling functions of the source and target domains differ, the above bound says that it is not sufficient to simply match the marginal distributions and achieve small error on the source domain. At the same time, we should also guarantee that the optimal labeling functions (or the conditional distributions of both domains) are not too far away from each other. As a side note, it is easy to see that and . In other words, they are essentially the crossdomain errors. When the crossdomain error is small, it implies that the optimal source (resp. target) labeling function generalizes well on the target (resp. source) domain.
Both the error term and the divergence in Theorem 4.1 are with respect to the true underlying distributions and , which are not available to us during training. In the following, we shall use the Rademacher complexity to provide for both terms a datadependent bound from empirical samples from and .
Definition 4.1 (Empirical Rademacher Complexity).
Let be a family of functions mapping from to and a fixed sample of size with elements in . Then, the empirical Rademacher complexity of with respect to the sample is defined as:
where and are i.i.d. uniform random variables taking values in .
With the empirical Rademacher complexity, we can show that w.h.p., the empirical source error cannot be too far away from the population error for all :
Lemma 4.3.
Let , then for all , w.p.b. at least , the following inequality holds for all : .
Similarly, for any distribution over , let be its empirical distribution from sample of size . Then for any two distributions and , we can also use the empirical Rademacher complexity to provide a datadependent bound for the perturbation between and :
Lemma 4.4.
Let , and be defined above, then for all , w.p.b. at least , the following inequality holds for all : .
Since is a hypothesis class, by definition we have:
Hence combining the above identity with Lemma 4.4, we immediately have w.p.b. at least :
(2) 
Now use a union bound and the fact that satisfies the triangle inequality, we have:
Lemma 4.5.
Let , and be defined above, then for , w.p.b. at least , for :
Combine Lemma 4.3, Lemma 4.5 and Theorem 4.1 with a union bound argument, we get the following main theorem that characterizes an upper bound for domain adaptation:
Theorem 4.2.
Let and be the source and target domains, and let be the empirical source and target distributions constructed from sample , each of size . Then for any function class and :
where .
Essentially, the generalization upper bound can be decomposed into three parts: the first part comes from the domain adaptation setting, including the empirical source error, the empirical divergence, and the shift between labeling functions. The second part corresponds to complexity measures of our hypothesis space and , and the last part describes the error caused by finite samples.
4.3 An InformationTheoretic Lower Bound
In Sec. 4.1, we constructed an example to demonstrate that learning invariant representations could lead to a feature space where the joint error on both domains is large. In this section, we extend the example by showing that a similar result holds in more general settings. Specifically, we shall prove that for any approach based on learning invariant representations, there is an intrinsic lower bound on the joint error of source and target domains, due to the discrepancy between their marginal label distributions. Our result hence highlights the need to take into account task related information when designing domain adaptation algorithms based on learning invariant representations.
Before we proceed to the lower bound, we first define several informationtheoretic concepts that will be used in the analysis. For two distributions and , the JensenShannon (JS) divergence is defined as:
where is the Kullback–Leibler (KL) divergence and . The JS divergence can be viewed as a symmetrized and smoothed version of the KL divergence, and it is closely related to the distance (total variation) between two distributions through Lemma B.3 in the Appendix.
Unlike the KL divergence, the JS divergence is bounded: . Additionally, from the JS divergence, we can define a distance metric between two distributions as well, known as the JS distance (Endres and Schindelin, 2003):
With respect to the JS distance and for any deterministic mapping , we can prove the following lemma via the data processing inequality:
Lemma 4.6.
Let and be two distributions over and let and be the induced distributions over by function , then
(3) 
For methods that aim to learn invariant representations for domain adaptation, an intermediate representation space is found through feature transformation , based on which a common hypothesis is shared between both domains (Ganin et al., 2016; Tzeng et al., 2017; Zhao et al., 2018b)
. Through this process, the following Markov chain holds:
(4) 
where is the predicted random variable of interest. Hence for any distribution over , this Markov chain also induces a distribution over and over . By Lemma 4.6, we know that . With these notations, noting that the JS distance is a metric, the following inequality holds:
Combining the above inequality with Lemma 4.6, we immediately have:
(5) 
Intuitively, and measure the distance between the predicted label distribution and the ground truth label distribution on the source and target domain, respectively. With the help of Lemma B.3, the following result establishes a relationship between and the accuracy of the prediction function :
Lemma 4.7.
Let where is the labeling function and be the prediction function, then .
We are now ready to present the main result of the section:
Theorem 4.3.
Suppose the Markov chain holds, then
Remark This theorem shows that if the marginal label distributions are significantly different between the source and target domains, then in order to achieve a small joint error, the induced distributions over from source and target domains have to be significantly different as well. Put another way, if we are able to find an invariant representation such that , then the joint error of the composition function has to be large:
Corollary 4.1.
Suppose the condition in Theorem 4.3 holds and , then:
Remark The lower bound gives us a necessary condition on the success of any domain adaptation approach based on learning invariant representations: if the marginal label distributions are significantly different between source and target domains, then minimizing and the source error will only increase the target error. In particular, if the transformation is rich enough such that and , then .
We conclude this section by noting that our bound on the joint error of both domains is not necessarily the tightest one. This can be seen from the example in Sec. 4.1, where , and we have , but in this case our result gives a trivial lower bound of 0. Nevertheless, our result still sheds new light on the importance of matching marginal label distributions in learning invariant representation for domain adaptation, which we believe to be a promising direction for the design of better adaptation algorithms.
5 Experiments
Our theoretical results on the lower bound of the joint error imply that overtraining the feature transformation function and the discriminator may hurt generalization on the target domain. In this section, we conduct experiments on realworld datasets to verify our theoretical findings. The task is digit classification on three datasets of 10 classes: MNIST (LeCun et al., 1998), USPS (Dheeru and Karra Taniskidou, 2017) and SVHN (Netzer et al., 2011). MNIST contains 60,000/10,000 train/test instances; USPS contains 7,291/2,007 train/test instances, and SVHN contains 73,257/26,032 train/test instances. We show the label distribution of these three datasets in Fig. 2. Among them, only MNIST has close to uniform label distributions.
Before training, we preprocess all the samples into gray scale singlechannel images of size , so they can be used by the same network. In our experiments, to ensure a fair comparison, we use the same network structure for all the experiments: 2 convolutional layers, one fully connected hidden layer, followed by a softmax output layer with 10 units. The convolution kernels in both layers are of size , with 10 and 20 channels, respectively. The hidden layer has 1280 units connected to 100 units before classification. For domain adaptation, we use the original DANN (Ganin et al., 2016) with gradient reversal implementation. The discriminator in DANN takes the output of convolutional layers as its feature input, followed by a fully connected layer, and a oneunit binary classification output. For all the experiments, we use the Adadelta method for optimization.
We plot four adaptation trajectories in Fig. 3
. Among the four adaptation tasks, we can observe two phases in the adaptation accuracy. In the first phase, the test set accuracy rapidly grows, in less than 10 iterations. In the second phase, it gradually decreases after reaching its peak, despite the fact that the source training accuracy keeps increasing smoothly. Those phase transitions can be verified from the negative slopes of the least squares fit of the adaptation curves (dashed lines in Fig.
3). We observe similar phenomenons on additional experiments using artificially unbalanced datasets trained on more powerful networks in Appendix C. The above experimental results imply that overtraining the feature transformation and discriminator does not help generalization on the target domain, but can instead hurt it when the label distributions differ (as shown in Fig. 2). These experimental results are consistent with our theoretical findings.6 Conclusion and Future Work
In this paper we theoretically and empirically study the important problem of learning invariant representations for domain adaptation. We show that learning an invariant representation and achieving a small source error is not enough to guarantee target generalization. We then prove both upper and lower bounds for the target and joint errors, which directly translate to sufficient and necessary conditions for the success of domain adaptation. We believe our results take an important step towards understanding deep domain adaptation, and also stimulate future work on the design of stronger deep domain adaptation algorithms that align conditional distributions. Another interesting direction for future work is to characterize what properties the feature transformation function should have in order to decrease the conditional shift between source and target domains. It is also worth investigating under which conditions the label distributions can be aligned without explicit labeled data from the target domain.
References
 Adel et al. (2017) Tameem Adel, Han Zhao, and Alexander Wong. Unsupervised domain adaptation with a relaxed covariate shift assumption. In AAAI, pages 1691–1697, 2017.
 Ajakan et al. (2014) Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, and Mario Marchand. Domainadversarial neural networks. arXiv preprint arXiv:1412.4446, 2014.
 Azizzadenesheli et al. (2018) Kamyar Azizzadenesheli, Anqi Liu, Fanny Yang, and Animashree Anandkumar. Regularized learning for domain adaptation under label shifts. 2018.

Bartlett and Mendelson (2002)
Peter L Bartlett and Shahar Mendelson.
Rademacher and gaussian complexities: Risk bounds and structural
results.
Journal of Machine Learning Research
, 3(Nov):463–482, 2002.  Becker et al. (2013) Carlos J Becker, Christos M Christoudias, and Pascal Fua. Nonlinear domain adaptation with boosting. In Advances in Neural Information Processing Systems, pages 485–493, 2013.
 BenDavid et al. (2007) Shai BenDavid, John Blitzer, Koby Crammer, Fernando Pereira, et al. Analysis of representations for domain adaptation. Advances in neural information processing systems, 19:137, 2007.
 BenDavid et al. (2010) Shai BenDavid, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(12):151–175, 2010.
 Blitzer et al. (2008) John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman. Learning bounds for domain adaptation. In Advances in neural information processing systems, pages 129–136, 2008.
 Cortes and Mohri (2014) Corinna Cortes and Mehryar Mohri. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519:103–126, 2014.
 Cortes et al. (2008) Corinna Cortes, Mehryar Mohri, Michael Riley, and Afshin Rostamizadeh. Sample selection bias correction theory. In International Conference on Algorithmic Learning Theory, pages 38–53. Springer, 2008.
 Courty et al. (2017a) Nicolas Courty, Rémi Flamary, Amaury Habrard, and Alain Rakotomamonjy. Joint distribution optimal transportation for domain adaptation. In Advances in Neural Information Processing Systems, pages 3730–3739, 2017a.
 Courty et al. (2017b) Nicolas Courty, Rémi Flamary, Devis Tuia, and Alain Rakotomamonjy. Optimal transport for domain adaptation. IEEE transactions on pattern analysis and machine intelligence, 39(9):1853–1865, 2017b.
 Dheeru and Karra Taniskidou (2017) Dua Dheeru and Efi Karra Taniskidou. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.

Endres and Schindelin (2003)
Dominik Maria Endres and Johannes E Schindelin.
A new metric for probability distributions.
IEEE Transactions on Information theory, 2003. 
Fu et al. (2017)
Lisheng Fu, Thien Huu Nguyen, Bonan Min, and Ralph Grishman.
Domain adaptation for relation extraction with domain adversarial
neural network.
In
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
, volume 2, pages 425–429, 2017.  Ganin et al. (2016) Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domainadversarial training of neural networks. Journal of Machine Learning Research, 17(59):1–35, 2016.
 Glorot et al. (2011) Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Domain adaptation for largescale sentiment classification: A deep learning approach. In Proceedings of the 28th international conference on machine learning (ICML11), pages 513–520, 2011.
 Gong et al. (2016) Mingming Gong, Kun Zhang, Tongliang Liu, Dacheng Tao, Clark Glymour, and Bernhard Schölkopf. Domain adaptation with conditional transferable components. In International conference on machine learning, pages 2839–2848, 2016.
 Hoffman et al. (2016) Judy Hoffman, Dequan Wang, Fisher Yu, and Trevor Darrell. Fcns in the wild: Pixellevel adversarial and constraintbased adaptation. arXiv preprint arXiv:1612.02649, 2016.
 Hoffman et al. (2017) Judy Hoffman, Eric Tzeng, Taesung Park, JunYan Zhu, Phillip Isola, Kate Saenko, Alexei A Efros, and Trevor Darrell. Cycada: Cycleconsistent adversarial domain adaptation. arXiv preprint arXiv:1711.03213, 2017.
 HosseiniAsl et al. (2018) Ehsan HosseiniAsl, Yingbo Zhou, Caiming Xiong, and Richard Socher. Augmented cyclic adversarial learning for domain adaptation. arXiv preprint arXiv:1807.00374, 2018.
 Kifer et al. (2004) Daniel Kifer, Shai BenDavid, and Johannes Gehrke. Detecting change in data streams. In Proceedings of the Thirtieth international conference on Very large data basesVolume 30, pages 180–191. VLDB Endowment, 2004.
 LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Lee and Raginsky (2018) Jaeho Lee and Maxim Raginsky. Minimax statistical learning with wasserstein distances. In Advances in Neural Information Processing Systems, pages 2692–2701, 2018.
 Lin (1991) Jianhua Lin. Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1):145–151, 1991.
 Lipton et al. (2018) Zachary C Lipton, YuXiang Wang, and Alex Smola. Detecting and correcting for label shift with black box predictors. arXiv preprint arXiv:1802.03916, 2018.

Long et al. (2014)
Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang Sun, and Philip S Yu.
Transfer joint matching for unsupervised domain adaptation.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 1410–1417, 2014.  Long et al. (2015) Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. Learning transferable features with deep adaptation networks. In International Conference on Machine Learning, pages 97–105, 2015.
 Long et al. (2016) Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. Unsupervised domain adaptation with residual transfer networks. In Advances in Neural Information Processing Systems, pages 136–144, 2016.
 Mansour and Schain (2012) Yishay Mansour and Mariano Schain. Robust domain adaptation. In ISAIM, 2012.
 Mansour et al. (2009a) Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. arXiv preprint arXiv:0902.3430, 2009a.

Mansour et al. (2009b)
Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh.
Multiple source adaptation and the rényi divergence.
In
Proceedings of the TwentyFifth Conference on Uncertainty in Artificial Intelligence
, pages 367–374. AUAI Press, 2009b.  Netzer et al. (2011) Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011.
 Pei et al. (2018) Zhongyi Pei, Zhangjie Cao, Mingsheng Long, and Jianmin Wang. Multiadversarial domain adaptation. 2018.
 Shen et al. (2018) Jian Shen, Yanru Qu, Weinan Zhang, and Yong Yu. Wasserstein distance guided representation learning for domain adaptation. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 Shrivastava et al. (2016) Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Josh Susskind, Wenda Wang, and Russ Webb. Learning from simulated and unsupervised images through adversarial training. arXiv preprint arXiv:1612.07828, 2016.
 Tzeng et al. (2017) Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. arXiv preprint arXiv:1702.05464, 2017.
 Zhang et al. (2013) Kun Zhang, Bernhard Schölkopf, Krikamol Muandet, and Zhikun Wang. Domain adaptation under target and conditional shift. In International Conference on Machine Learning, pages 819–827, 2013.
 Zhang et al. (2017) Yuan Zhang, Regina Barzilay, and Tommi Jaakkola. Aspectaugmented adversarial networks for domain adaptation. arXiv preprint arXiv:1701.00188, 2017.
 Zhao et al. (2017) Han Zhao, Zhenyao Zhu, Junjie Hu, Adam Coates, and Geoff Gordon. Principled hybrids of generative and discriminative domain adaptation. arXiv preprint arXiv:1705.09011, 2017.
 Zhao et al. (2018a) Han Zhao, Shanghang Zhang, Guanhang Wu, Geoffrey J Gordon, et al. Multiple source domain adaptation with adversarial learning. In International Conference on Learning Representations, 2018a.
 Zhao et al. (2018b) Han Zhao, Shanghang Zhang, Guanhang Wu, José MF Moura, Joao P Costeira, and Geoffrey J Gordon. Adversarial multiple source domain adaptation. In Advances in Neural Information Processing Systems, pages 8568–8579, 2018b.
Appendix A Missing Proofs
See 4.1
Proof.
By definition, for , we have:
(6) 
Since , then , . We now use Fubini’s theorem to bound :
Now in view of (6) and the definition of , we have:
Combining all the inequalities above finishes the proof. ∎
See 4.2
Proof.
∎
See 4.1
Proof.
See 4.3
Proof.
Consider the source domain . For , define the loss function as . First, we know that where we slightly abuse the notation to mean the family of functions :
Comments
There are no comments yet.