Support and Invertibility in Domain-Invariant Representations

03/08/2019 ∙ by Fredrik D. Johansson, et al. ∙ 10

Learning domain-invariant representations has become a popular approach to unsupervised domain adaptation and is often justified by invoking a particular suite of theoretical results. We argue that there are two significant flaws in such arguments. First, the results in question hold only for a fixed representation and do not account for information lost in non-invertible transformations. Second, domain invariance is often a far too strict requirement and does not always lead to consistent estimation, even under strong and favorable assumptions. In this work, we give generalization bounds for unsupervised domain adaptation that hold for any representation function by acknowledging the cost of non-invertibility. In addition, we show that penalizing distance between densities is often wasteful and propose a bound based on measuring the extent to which the support of the source domain covers the target domain. We perform experiments on well-known benchmarks that illustrate the short-comings of current standard practice.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Domain transfer is a critical component of many machine learning problems: Self-driving cars must be robust to changes in weather conditions and landscape; Estimates of the efficacy of drugs that pass clinical trials should be valid for the population to which the drugs are prescribed; Policies for robotic control learned in simulated environments should be useful in the real world. In so-called

unsupervised domain adaptation, labeled data are available only in a limited setting (e.g. driving only in San Francisco; patients restricted to a clinical trial cohort; simulated environments) called the source domain. The context in which models are ultimately applied is called the target domain.

When the label function is assumed stationary and source and target domains share statistical support, the classical solution to domain adaptation problems is importance sampling (IS) (Shimodaira, 2000)

. In IS methods, the influence of an observation on the learning algorithm is determined by its likelihood ratio between target and source domains. While asymptotically unbiased, IS estimators suffer from large variance 

(Cortes et al., 2010) and are inapplicable when the target domain is not covered by the source. The latter is typical for many of the high-dimensional problems addressed in modern machine learning.

Domain-invariant representations have emerged as new, widely-used tools for domain transfer (Ben-David et al., 2007; Ganin et al., 2016; Long et al., 2015) in problems where the label function is assumed fixed, but the covariate distribution changes between domains—so-called covariate shift. These methods work by uncovering predictive components of data that are distributed similarly across domains—an idea that has been justified by a string of theoretical work (Ben-David et al., 2007; Mansour et al., 2009; Ben-David et al., 2010a; Cortes & Mohri, 2011). Crucially, these bounds do not rely on common support. Related ideas have been applied also under target (label) shift (Gong et al., 2016).

Given the prevalence of algorithms learning domain-invariant representations, we ask: Under what conditions do these algorithms recover an optimal hypothesis? What are potential failure modes? We argue that there is discord between existing theoretical guarantees, how they are used to justify learning algorithms, and how these algorithms perform empirically. In particular, we give small example in which a) the objective of many algorithms is minimal but the target error is arbitrarily bad, and b) empirical performance is good but generalization bounds are surprisingly large.

First, we argue that regularizing representations to be domain invariant is too strict, in particular when domains (partially) overlap. We support this claim by giving examples where empirical risk minimization on source data only outperforms domain-invariant representation learning algorithms. As an alternative, we give a generalization bound that measures the lack of overlapping support between domains. Our bound applies directly to learned representations and is tight when source and target domains are equal.

Second, for domain-invariant representation learning to succeed, the label must be predictable from the learned representation. When representations are regularized to reduce domain discrepancy, the class of admissible hypotheses shrinks and predictions worsen. This phenomenon may be asymmetric: a representation may be more suitable for the source domain than the target domain. We use this insight to characterize the unobservable adaptation error from losing information in non-invertible representations.

Finally, we study the performance of domain-invariant representation learning on a well-known benchmark task through the lens of our theoretical findings.

2 Background

We study unsupervised domain adaptation, defined as follows. Samples of features and labels are observed from a source domain, distributed according to a density . In addition, we observe unlabeled samples from a target domain, distributed according to a density . Unobserved labels in the target domain are distributed according to . Based on and , the unsupervised domain adaptation problem is to obtain an hypothesis that minimizes the target risk

as measured by a loss function

,

(1)

Analogously to (1), we define the source risk as . When clear from context, we leave out the superscript indicating the loss function. In the sequel, unless otherwise stated, we let and be the zero-one loss, . We call the adaptation error.

In this work, we make the covariate shift assumption which states that the conditional density of labels given features is stationary across domains. This is justifiable in some problems but not in all (Gong et al., 2016).

Assumption 1.

Domains and satisfy the covariate shift assumption if

We say that is realizable in if and that is identifiable over if, under a set of assumptions on , a function may be obtained based on knowledge of and such that . In the case of deterministic hypotheses and labels, or when only label expectations are of interest, we may substitute conditional densities with appropriate mappings.

As no labels are observed from the target domain, models that minimize risk (only) on the source domain are often biased. There are two common alternative strategies to minimize target risk: importance-weighting and minimization of upper bounds on the target risk.

2.1 Importance weighting

Under Assumption 1 (covariate shift), the target risk may be approximated using importance-weighted samples from the source density (Shimodaira, 2000),

(2)

If the weighting function is chosen to be , is a consistent estimator of under the following assumption.

Assumption 2 (Sufficient support).

We say that has -sufficient support for if , with . This is also called -overlap.

Cortes et al. (2010) give generalization bounds for importance-weighted estimates such as (2). These estimates have high variance when the largest in Assumption 2 is small; if there is no such , importance weighting is inapplicable without modification.

2.2 Upper bounds on target risk

When the source domain does not provide sufficient support, , the target risk of a learned hypothesis may not be consistently estimated without further assumptions. However, we may bound the target risk from above and minimize this bound.

Ben-David et al. (2007) introduced the -distance to measure the worst-case loss from extrapolating between domains using binary hypotheses in a class . Let denote the expected disagreement between two hypotheses . Then, the -distance between and is111The definition is sometimes given with a factor 2.

(3)

By reducing the model class , the potential disagreement between member functions may be reduced—as well as the capacity of to predict the label . The best-in-class joint hypothesis risk is

(4)

These quantities lead to the following bound by applying the triangle-inequality of classification error.

Theorem 1 (Adaptation bound by Ben-David et al. (2010a)).

Under Assumption 1, for all ,

(5)

The result (5) may be bounded further based on the risk on a sample from the source domain, and an empirical estimate of  (Ben-David et al., 2010a). Similar results have also been obtained for continuous labels (Mansour et al., 2009; Cortes & Mohri, 2011).

2.3 Domain-invariant representation learning

Theorem 1, and a suite of follow-up work, have been used to justify algorithms based on learning domain-invariant representations—transformations of features such that the source and target domains are approximately indistinguishable in the transformed space (Ben-David et al., 2007). We describe these below.

Let a random variable

be a representation of the input features , parameterized by a deterministic function with . Hypotheses for are formed by compositions with prediction functions operating in the representation space , and

. The probability of a set

induced by is then . If does not induce atoms, is a density; we consider only this case in the sequel.

We say that a representation of is domain-invariant if . A common approach to learning approximately domain-invariant representations is to solve the following problem222We leave out additional regularization of and ..

(6)

Here, denote empirical distributions of and , is a distance function on densities, and

is a hyperparameter. In the next section, we describe several instantiations of (

6).

3 Related work

Domain adaptation has been studied primarily under Assumption 2 (covariate shift) (Pan et al., 2010), which is the setting also of this work. However, prediction under shift in the target and conditional has also been considered (Zhang et al., 2013; Gong et al., 2016; Lipton et al., 2018). A common approach in both settings is to learn representations or projection of observed data that is invariant to the shift in question by minimizing adversarial losses (Ganin & Lempitsky, 2015; Bousmalis et al., 2016; Tzeng et al., 2017), integral probability metrics such as the maximum mean discrepancy (MMD) (Pan et al., 2011; Long et al., 2015, 2016; Baktashmotlagh et al., 2013) and the Wasserstein distance (Shalit et al., 2016; Courty et al., 2017), or other divergences (Berisha et al., 2016; Si et al., 2010; Muandet et al., 2013).

Many recent methods attempt to solve domain adaptation under covariate shift by optimizing objectives similar to (6(Ganin & Lempitsky, 2015; Long et al., 2015; Bousmalis et al., 2016), with the distance chosen to be a metric such that . However, (Gong et al., 2016) point out that it is not clear under what conditions would imply . Ben-David et al. (2010b) showed that Assumption 1 (covariate shift) and small are not sufficient on their own to identify . On the other hand, Ben-David & Urner (2012) showed that Assumption 2 (sufficient support) is sufficient by counterexample through a reduction of the Left-Right problem (Kelly et al., 2010). Ben-David & Urner (2014) subsequently gave both upper and lower learning bounds for nearest-neighbor learners under Assumptions 2 and so-called probabilistic Lipschitzness.

Next, we argue that that searching for a representation such that is often undesirable, and that learning objectives in the style of (6) are insensitive to information lost in domain-invariant representations.

4 Limitations of domain-invariant representation learning

In this section, we give concrete examples of the failure modes of domain-invariant representation learning, and propose a shift in focus for future research.

4.1 Representation-induced adaptation error

When features are high dimensional, they often contain information that is redundant or irrelevant for predicting the label but distinguishes the source and target domains; the higher the dimensionality, the less likely overlap is to hold in  (D’Amour et al., 2017). The adaptation bounds reviewed in the previous section suggest that removing such information may reduce the difference between source and target risk by making domains closer in density. However, doing so may also introduce an unobservable error, as we see this in the following example.

Figure 1: Illustration of Example 1 in which there are two optimal solutions to (6) with objective value 0 but with radically different target risk.
Example 1 (Variable selection).

Let with if is in the lower left or upper right quadrants, , and if is in the upper left or lower right quadrants, . Further, let if , and otherwise (see Figure 1). Now, let be the set of variable selections from to and let be the set of threshold functions in . Then, for either selection of a single variable, or , with we have that and , and the function has . However, , but . Hence, objective (6) is uninformative of the target risk.

Example 1 illustrates the impossibility of domain adaptation without overlap or other additional assumptions. Based on the observed data, there is nothing that distinguishes a failure case with maximum target risk from a successful case with minimal target risk. This is true even despite the fact that the problem satisfies the following strong condition.

Assumption 3 (Optimal domain-invariant representation).

There exist a representation and such that and  .

Assumption 3 is by no means guaranteed to hold in practice. Often, variables that are distributed differently across domains are critical for prediction. Regardless, Assumption 3 is necessary for domain-invariant representation learning to be consistent. However, as strong as this assumption is, it is not sufficient for consistent domain adaptation—not with domain-invariant representations nor with any other method.

Even in problems that are possible to solve consistently, a learned representation may be more predictive on the source domain than the target domain. To reason about this case, we must apply Theorem 1 to the hypothesis space induced by the representation . Then, for all ,

(7)

Here, and may be bounded and minimized but, in contrast, is unobserved and may increase when solving (6).

Proposition 1.

For all as defined above, we have with and

(8)
(9)
Proof.

The results follow immediately from the definitions of and , that , and that . ∎

As a result of Proposition 1, solving (6) implies neither minimization of the RHS of (5) or (7). One interpretation of this result, and of Example 1, is that covariate shift (Assumption 1) need not hold with respect to the representation , even if it does with respect to . With ,

Equality holds for general only if is invertible. In Section 5, we define a quantity that measures the effect of this discrepancy and how it relates to invertibility, and use it to bound the target risk.

We summarize this section in a statement inspired by Lemma 2 in Bareinboim & Pearl (2013).

If there are two distinct hypotheses for the label that are both consistent with and and a set of assumptions , but result in different predictions on , is not identifiable over .

Like causal inference (Pearl, 2009), successful domain adaptation is often entirely reliant on making appropriate assumptions about unobservable quantities.

4.2 The cost of domain invariance

(a) Problem A
(b) Problem B
(c) MMD for varying bandwidth,
Figure 2: Examples illustrating the counter-intuitive effects of using density distance metrics for regularizing domain adaptation methods. Despite the fact that sufficient support is satisfied in Problem A, typical adaptation bounds (using e.g. the RBF-kernel MMD, see (c)) are smaller for Problem B than for Problem A. In contrast, our proposed support sufficiency divergence with (see Section 5) is 0 in Problem A and 0.33 for Problem B.

A desired property of adaptation bounds is that they are as tight as possible when Assumption 2 (sufficient support) holds, since consistent estimation is possible in this setting (Ben-David & Urner, 2012)333By “consistency”, we refer to the convergence of an estimate to a quantity of interest given enough samples.. However, bounds based on Theorem 1 do not always have this property, and their looseness is often independent of the observed risk on the source domain. We give an example of how unintuitive this can be below.

Example 2.

We illustrate two examples of source and target densities in Figure 2 along with the estimated maximum mean discrepancy (MMD) (Gretton et al., 2012) between domains for a Gaussian RBF-kernel with varying bandwidth . The MMD has been used to bound and the target risk in Gretton et al. (2009); Long et al. (2015); Pan et al. (2011); Gong et al. (2016), among others. Despite there being a significant lack of overlap between the support of source and target domains in Problem B, the MMD is smaller than in Problem A, in which the support of the target domain is completely covered by the source density. However, Problem A satisfies sufficient assumptions for identifiability, whereas Problem B does not. This illustrates a drawback of representation learning methods that penalize distributional distance between domains.

The problem illustrated in Example 2 has practical consequences, as we see in Section 6.2. When label marginal distributions differ in a classification task, but domains partially overlap, requiring domain invariance is often too strict. In fact, in our examples, training using only source labels often does better than domain-invariant representation learning.

5 A new support-based bound

We proceed to bound the target risk of an hypothesis in terms of its error on the source domain and the expected lack of sufficient support. This bound is aimed at overcoming limitations of existing bounds by a) explicitly characterizing the risk induced by non-invertible representations and b) avoiding unnecessary side effects of domain invariance.

We say that there is lack of sufficient support at a point if the target density is larger than the source density and the source density is small, as defined by ,

(10)

We let serve as short-hand for . Below, we define the support sufficiency divergence.

Definition 1.

For distributions, , the support sufficiency divergence from to is defined by

Note that is not symmetric, but is for . Crucially however, it is 0 also when for some choices of , if . Further, it holds that and the bounds are tight (see Appendix A.1 for a proof).

Our main result builds on the idea that we can expect an hypothesis to be accurate on the target domain in regions where the source density is sufficiently high. First, let be a weighting function such that

(11)

We may state the following result.

Lemma 1.

Let be densities over . Further, let be a function such that . Then, with ,

Equality holds if or if . The second term is 0 if and only if Assumption 2 holds with , by definition. The proof can be found in Appendix A.2.

Before we state our main result, we define a measure of the impact of non-invertibility in representations.

Definition 2.

Given are domains and , a prediction function , a label , a loss and a representation . Let

Then, the excess target information loss is

We say that the information loss induced by the representation is symmetric if . Both and are always 0 for invertible . Note also that may be negative, although we don’t expect this in practice as we explain later.

By Lemma 1 and Definition 2, we have the following.

Theorem 2.

Consider any feature representation with and prediction function , and define . Further, let and be the two distributions induced by the representation applied to distributed according to . Further, assume that for any hypothesis and a loss function , . For any ,

(12)
  • For any , we have that

    By adding and subtracting ,

    The last two terms equal as by Assumption 1. Note that the marginal density over is equal in both of the last terms. The first term may be decomposed by the support of . With , we get

    Adding and subtracting , we get

    Bounding the second term by and removing the third non-positive term, we obtain the result. For a full proof, see Appendix A.3.∎

Theorem 2 is consistent with our intuition that increasing the sufficiency of the support of for leads to better adaptation. If this overlap is increased without losing information, such as through collection of additional samples, this is usually preferable.

Unlike bounds based on the triangle inequality  (Ben-David et al., 2010a; Mansour et al., 2009; Cortes & Mohri, 2011), the bound in Theorem 2 is tight when . On the other hand, when the supports of and are completely disjoint, the bound is non-informative. In Section 5.1 we obtain a tighter bound for the disjoint case by incorporating additional assumptions. In many problems, however, there is partial overlap, such as under label marginal shift.

For domains with common and bounded support, may be chosen such that minimizing the bound of Theorem 2 reduces to importance sampling. In fact, we may view Theorem 2 as a middle-ground between importance sampling estimates and upper bounds on the target risk, using importance sampling where feasible. The choice of in Lemma 2 trades off the sizes of the two middle terms in (12)—small , larger first term and vice versa. Additionally, if is 0 everywhere on , the first term is 0. The choice of also affects the variance in Monte-Carlo estimates of these terms. If is close to , the weights are potentially larger, and variance increases (Cortes et al., 2010).

When is invertible, as . Shalit et al. (2016) gave a bound based on integral probability metrics in the style of Theorem 1, with the additional restriction that is invertible. However, this is a strong restriction as such cannot increase the sufficiency of support w.r.t. and . We conjecture that under appropriate assumptions of smoothness, is larger for less invertible . By encouraging to be near-invertible, this is mitigated. This would serve as justification for reconstruction losses used by for example Bousmalis et al. (2016). Alternatively, if any information lost in is equally important for predicting labels in the source domain as in the target. If is always true, Assumption 3 is sufficient for identification of the label.

5.1 Incorporating assumptions on the loss

In Theorem 2, the loss at points outside of the overlap between domains is bounded from above by a constant, . As a result, the bound is uninformative for disjoint domains. If prior knowledge about the label function is available, we may address this by making assumptions about how the label function extrapolates, akin to Theorem 1. Below, we give an alternative bound based on an assumption that the loss of hypotheses using a representation belongs to a known family . Critically, this new bound remains qualitatively different from previous work as a) it penalizes extrapolation between domains only in regions where the source density is low and b) it explicitly characterizes the excess target risk due to information lost in the learned representation.

Definition 3.

We define the integral probability metric (IPM) support sufficiency divergence between densities on with respect to a class of functions by

(13)

where .

Theorem 3.

Assume that for any representation , and any , . Under the conditions of Theorem 2, we have

Remark 1.

Theorem 3 provides a tighter bound than Theorem 2 at the cost of stronger assumptions. With , and ,

For the first inequality to be tight, the maximizer of (13) must be flexible enough to always be equal to when and always equal to when . This is unlikely to be true of the actual loss when the supports of and overlap. Instead, it is common to assume that obeys some smoothness conditions. In Appendix A.4 we show how may be estimated using kernel evaluations if is a reproducing-kernel Hilbert space, following Gretton et al. (2012).

6 Empirical results

(a) “0” in MNIST
(b) “0” in MNIST-M
Figure 3: A benchmark for domain adaptation: MNISTMNIST-M (Ganin & Lempitsky, 2015).

We revisit previous empirical results in light of our theoretical findings with emphasis on Domain-Adversarial Neural Networks (DANN) by 

(Ganin et al., 2016)444Our implementation is based on that of https://github.com/pumpikano/tf-dann.

(a) MNISTMNIST
(b) MNISTMNIST-M
(c) MNISTMNIST-M
Figure 4: Left: Target error as a function of marginal label distribution. For each setting, a DANN model is trained on unlabeled target data and labeled source data. We compare the accuracy of this model to a model tuned on target labels but with a fixed representation given by the first model. Different lines of the same color indicate different values of penalty strength . Right: Embeddings learned by DANN with equal (top) and unequal (bottom) label marginal distributions. In MNIST-M, all images of digits 0,1,2 have been removed. Grey digits are from the source domain and black digits from the target domain.

6.1 Plausibility of sufficient assumptions

The most common benchmarks for domain adaptation algorithms are computer vision and natural language processing tasks. One example is the

task (Ganin & Lempitsky, 2015)

, in which the goal is to learn to classify handwritten digits overlayed with random photographs (MNIST-M) based on labeled images of digits alone (MNIST) 

(LeCun et al., 1998) (see Figure 3). For this task, we can immediately rule out Assumption 2 of sufficient support, as MNIST-M images are full-color images that have measure 0 in MNIST. Still, previous work have achieved target accuracy of when training on source data alone, and when using unlabeled target data, compared to when using labeled target data (Ganin & Lempitsky, 2015; Bousmalis et al., 2016). These results support Assumption 3—that there exists a domain-invariant representation in which the labeling function is approximately realizable.

6.2 Contrasting support and domain variance

When label marginal distributions differ under covariate shift, , such as when objects of a certain class appear more often in one domain, a distance between feature marginals, , is induced. Authors have studied this restricted setting in detail (Zhang et al., 2013; Lipton et al., 2018). If additionally the target domain is made up of a subset of the source domain, encouraging domain invariance may cause more harm than good. We study a) the performance of DANN models under domain shift with sufficient support, and b) the realizability of the label in the learned representation.

We create a task in which the source domain is the standard MNIST dataset and the target domain is a version of MNIST for which domain shift is induced by successively removing digit classes from the support of the target domain, leaving the source domain fixed. In this setup, the support of the target domain is contained in the source domain, and empirical risk minimization based on source data alone should be a good baseline. We compare to the case where the target is replaced by MNIST-M, but perturbed in the same way.

The DANN model optimizes (6), with an adversarial neural network classifying images by domain, and a hyperparameter controlling the strength of this penalty in the objective

In this way, we interpolate between empirical risk minimization (

), the standard DANN formulation () and prioritizing domain invariance (). We compare the error of two different models: 1) The standard DANN estimator , and 2) A model in which the learned representation from 1) is fixed and the prediction function is fit to the target labels (Tuned). The latter serves to give an upper bound on best-case risk when predicting from the representations learned by DANN.

In Figure 4, we observe that models trained without target supervision (DANN) perform steadily worse on MNISTMNIST, the more the label marginal distribution is perturbed. This holds also for MNISTMNIST-M, where sufficient support is not satisfied. There, DANN is beneficial for small label shift, but eventually does no better than a model trained using only source data. Learning with a domain-adversarial loss appears to have little impact on the realizability of the target label in the representation; the target-tuned models achieve almost as good performance as the fully target-trained lower bound. In Figure 3(c), we see that the embeddings learned using DANN models under label marginal shift show worse separation between classes, than the embeddings learned under equal label marginal distributions (see Appendix C).

7 Discussion

We have studied algorithms for unsupervised domain adaptation based on domain-invariant representation learning and the theoretical arguments used to support them. We find that, despite empirical success, the theoretical justification of these algorithms is flawed in that oft-cited generalization bounds are not minimized by the learned representations. In particular, the literature has failed to characterize conditions under which domain-invariant representations lead to consistent estimation. We have found through examples and experiments on domain adaptation benchmarks that domain invariance is often too strong a requirement for learning, both when there is overlap between domains and when there is not. This stems from the fact that overlapping support is sufficient for domain transfer, and equality in densities is not necessary.

We have proposed alternative bounds that measure distance in support instead of density and that explicitly recognize loss incurred by non-invertible representations. Our bounds suggest several ways to design new algorithms. First, minimizing the second term in our bound, the support sufficiency divergence, may be achieved by replacing indicator functions by hinge losses (see Appendix B

). This increases the looseness of the bound, but makes its derivative informative. In the same spirit, we may design new heuristics that regularize representations only in points at which the source density is much smaller than the target density. Second, while the excess adaptation error induced by learning non-invertible transformations is unobservable, it is associated with the information loss of the representation. To avoid this, we may attempt to maintain a small excess by imposing a reconstruction loss on the representation, similar to 

Bousmalis et al. (2016).

Acknowledgements

We thank Zach Lipton, Alexander D’Amour, Christina X Ji and Hunter Lang for insightful feedback. This work was supported in part by Office of Naval Research Award No. N00014-17-1-2791 and the MIT-IBM Watson AI Lab.

References

  • Baktashmotlagh et al. (2013) Baktashmotlagh, M., Harandi, M.T., Lovell, B.C. & Salzmann, M. (2013). Unsupervised domain adaptation by domain invariant projection. In Proceedings of the IEEE International Conference on Computer Vision, 769–776.
  • Bareinboim & Pearl (2013) Bareinboim, E. & Pearl, J. (2013). A general algorithm for deciding transportability of experimental results. Journal of causal Inference, 1, 107–134.
  • Ben-David & Urner (2012) Ben-David, S. & Urner, R. (2012). On the hardness of domain adaptation and the utility of unlabeled target samples. In International Conference on Algorithmic Learning Theory, 139–153, Springer.
  • Ben-David & Urner (2014) Ben-David, S. & Urner, R. (2014). Domain adaptation–can quantity compensate for quality?

    Annals of Mathematics and Artificial Intelligence

    , 70, 185–202.
  • Ben-David et al. (2007) Ben-David, S., Blitzer, J., Crammer, K. & Pereira, F. (2007). Analysis of representations for domain adaptation. In Advances in neural information processing systems, 137–144.
  • Ben-David et al. (2010a) Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F. & Vaughan, J.W. (2010a). A theory of learning from different domains. Machine learning, 79, 151–175.
  • Ben-David et al. (2010b) Ben-David, S., Lu, T., Luu, T. & Pál, D. (2010b). Impossibility theorems for domain adaptation. In International Conference on Artificial Intelligence and Statistics, 129–136.
  • Berisha et al. (2016) Berisha, V., Wisler, A., Hero, A.O. & Spanias, A. (2016). Empirically estimable classification bounds based on a nonparametric divergence measure. IEEE Transactions on Signal Processing, 64, 580–591.
  • Bousmalis et al. (2016) Bousmalis, K., Trigeorgis, G., Silberman, N., Krishnan, D. & Erhan, D. (2016). Domain separation networks. In Advances in Neural Information Processing Systems, 343–351.
  • Comminges et al. (2012) Comminges, L., Dalalyan, A.S. et al. (2012). Tight conditions for consistency of variable selection in the context of high dimensionality. The Annals of Statistics, 40, 2667–2696.
  • Cortes & Mohri (2011) Cortes, C. & Mohri, M. (2011). Domain adaptation in regression. In International Conference on Algorithmic Learning Theory, 308–323, Springer.
  • Cortes et al. (2010) Cortes, C., Mansour, Y. & Mohri, M. (2010). Learning bounds for importance weighting. In Advances in neural information processing systems, 442–450.
  • Courty et al. (2017)

    Courty, N., Flamary, R., Habrard, A. & Rakotomamonjy, A. (2017). Joint distribution optimal transportation for domain adaptation. In

    Advances in Neural Information Processing Systems, 3733–3742.
  • D’Amour et al. (2017) D’Amour, A., Ding, P., Feller, A., Lei, L. & Sekhon, J. (2017). Overlap in observational studies with high-dimensional covariates. arXiv preprint arXiv:1711.02582.
  • Ganin & Lempitsky (2015)

    Ganin, Y. & Lempitsky, V. (2015). Unsupervised domain adaptation by backpropagation. In

    International Conference on Machine Learning, 1180–1189.
  • Ganin et al. (2016) Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M. & Lempitsky, V. (2016). Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17, 2096–2030.
  • Gong et al. (2016) Gong, M., Zhang, K., Liu, T., Tao, D., Glymour, C. & Schölkopf, B. (2016). Domain adaptation with conditional transferable components. In International Conference on Machine Learning, 2839–2848.
  • Gretton et al. (2009) Gretton, A., Smola, A.J., Huang, J., Schmittfull, M., Borgwardt, K.M. & Schölkopf, B. (2009). Covariate shift by kernel mean matching.
  • Gretton et al. (2012) Gretton, A., Borgwardt, K.M., Rasch, M.J., Schölkopf, B. & Smola, A. (2012). A kernel two-sample test. Journal of Machine Learning Research, 13, 723–773.
  • Kelly et al. (2010) Kelly, B.G., Tularak, T., Wagner, A.B. & Viswanath, P. (2010). Universal hypothesis testing in the learning-limited regime. In Information Theory Proceedings (ISIT), 2010 IEEE International Symposium on, 1478–1482, IEEE.
  • LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86, 2278–2324.
  • Lipton et al. (2018) Lipton, Z.C., Wang, Y.X. & Smola, A. (2018). Detecting and correcting for label shift with black box predictors. arXiv preprint arXiv:1802.03916.
  • Long et al. (2015) Long, M., Cao, Y., Wang, J. & Jordan, M.I. (2015). Learning transferable features with deep adaptation networks. arXiv preprint arXiv:1502.02791.
  • Long et al. (2016) Long, M., Zhu, H., Wang, J. & Jordan, M.I. (2016). Deep transfer learning with joint adaptation networks. arXiv preprint arXiv:1605.06636.
  • Mansour et al. (2009) Mansour, Y., Mohri, M. & Rostamizadeh, A. (2009). Domain adaptation: Learning bounds and algorithms. arXiv preprint arXiv:0902.3430.
  • Muandet et al. (2013) Muandet, K., Balduzzi, D. & Schölkopf, B. (2013). Domain generalization via invariant feature representation. In International Conference on Machine Learning, 10–18.
  • Pan et al. (2010) Pan, S.J., Yang, Q. et al.

    (2010). A survey on transfer learning.

    IEEE Transactions on knowledge and data engineering, 22, 1345–1359.
  • Pan et al. (2011) Pan, S.J., Tsang, I.W., Kwok, J.T. & Yang, Q. (2011). Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 22, 199–210.
  • Pearl (2009) Pearl, J. (2009). Causality. Cambridge university press.
  • Shalit et al. (2016) Shalit, U., Johansson, F. & Sontag, D. (2016). Estimating individual treatment effect: generalization bounds and algorithms. arXiv preprint arXiv:1606.03976.
  • Shimodaira (2000) Shimodaira, H. (2000). Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90, 227–244.
  • Si et al. (2010) Si, S., Tao, D. & Geng, B. (2010). Bregman divergence-based regularization for transfer subspace learning. IEEE Transactions on Knowledge and Data Engineering, 22, 929.
  • Tzeng et al. (2017) Tzeng, E., Hoffman, J., Saenko, K. & Darrell, T. (2017). Adversarial discriminative domain adaptation. In

    Computer Vision and Pattern Recognition (CVPR)

    , vol. 1, 4.
  • Zhang et al. (2013) Zhang, K., Schölkopf, B., Muandet, K. & Wang, Z. (2013). Domain adaptation under target and conditional shift. In International Conference on Machine Learning, 819–827.

Appendix A Proofs

a.1 Proof of bounds for support sufficiency divergence

Lemma 2.

The support sufficiency divergence is bounded with , and the bounds are tight.

Proof.

The lower bound holds and is tight because

which is clearly non-negative. Moreover, for , . The upper bound holds trivially as . For tightness, let be discrete densities over two states, and . Then with , . ∎

Recall that

(14)

a.2 Proof of Lemma 1

Lemma 3.

Let be densities over . Further, define . Then,

Proof.

We have,

Further, implies equality when . ∎

a.3 Proof of Theorem 2

Lemma 4.

Assume that . Define and let . Then,

Proof.

Lemma 5.
Proof.

By Lemma A.3

We have that

and by the same argument,

and as a result,

Theorem 2 (Restated).

Consider any feature representation with and prediction function , and define . Further, let and be the two distributions induced by the representation applied to distributed according to . Further, assume that for any hypothesis and a loss function , . Now, with , we have the following result.