On Learning Invariant Representation for Domain Adaptation

01/27/2019 ∙ by Han Zhao, et al. ∙ Microsoft Carnegie Mellon University 0

Due to the ability of deep neural nets to learn rich representations, recent advances in unsupervised domain adaptation have focused on learning domain-invariant features that achieve a small error on the source domain. The hope is that the learnt representation, together with the hypothesis learnt from the source domain, can generalize to the target domain. In this paper, we first construct a simple counterexample showing that, contrary to common belief, the above conditions are not sufficient to guarantee successful domain adaptation. In particular, the counterexample (Fig. 1) exhibits conditional shift: the class-conditional distributions of input features change between source and target domains. To give a sufficient condition for domain adaptation, we propose a natural and interpretable generalization upper bound that explicitly takes into account the aforementioned shift. Moreover, we shed new light on the problem by proving an information-theoretic lower bound on the joint error of any domain adaptation method that attempts to learn invariant representations. Our result characterizes a fundamental tradeoff between learning invariant representations and achieving small joint error on both domains when the marginal label distributions differ from source to target. Finally, we conduct experiments on real-world datasets that corroborate our theoretical findings. We believe these insights are helpful in guiding the future design of domain adaptation and representation learning algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The recent successes of supervised deep learning methods have been partially attributed to rich datasets and increasing computational power. However, in many critical applications, e.g., self-driving cars or personal healthcare, it is often prohibitively expensive and time-consuming to collect large-scale supervised training data. Unsupervised domain adaptation (DA) focuses on such limitations by trying to transfer knowledge from a labeled source domain to an unlabeled target domain, and a large body of work tries to achieve this by exploring domain-invariant structures and representations to bridge the gap. Theoretical results 

(Ben-David et al., 2010; Mansour et al., 2009a; Mansour and Schain, 2012) and algorithms (Glorot et al., 2011; Becker et al., 2013; Ajakan et al., 2014; Adel et al., 2017; Pei et al., 2018) under this setting are abundant.

Due to the ability of deep neural nets to learn rich feature representations, recent advances in domain adaptation have focused on using these networks to learn invariant representations, i.e., intermediate features whose distribution is the same in source and target domains, while at the same time achieving small error on the source domain. The hope is that the learnt intermediate representation, together with the hypothesis learnt using labeled data from the source domain, can generalize to the target domain. Nevertheless, from a theoretical standpoint, it is not at all clear whether aligned representations and small source error are sufficient to guarantee good generalization on the target domain. In fact, despite being successfully applied in various applications (Zhang et al., 2017; Hoffman et al., 2017), it has also been reported that such methods fail to generalize in certain closely related source/target pairs, e.g., digit classification from MNIST to SVHN (Ganin et al., 2016).

Figure 1: A counterexample where invariant representations lead to large joint error on source and target domains. Before transformation of , iff achieves perfect classification on both domains. After transformation, source and target distributions are perfectly aligned, but no hypothesis can achieve a small joint error.

Given the wide application of domain adaptation methods based on learning invariant representations, we attempt in this paper to answer the following important and intriguing question: Is finding invariant representations while at the same time achieving a small source error sufficient to guarantee a small target error? If not, under what conditions is it? Contrary to common belief, we give a negative answer to the above question by constructing a simple example showing that these two conditions are not sufficient to guarantee target generalization, even in the case of perfectly aligned representations between the source and target domains. In fact, our example shows that the objective of learning invariant representations while minimizing the source error can actually be hurtful, in the sense that the better the objective, the larger the target error. At a colloquial level, this happens because learning invariant representations can break the originally favorable underlying problem structure, i.e., close labeling functions and conditional distributions. To understand when such methods work, we propose a generalization upper bound as a sufficient condition that explicitly takes into account the conditional shift between source and target domains. The proposed upper bound admits a natural interpretation and decomposition in domain adaptation; we show that it is tighter than existing results in certain cases.

Simultaneously, to understand what the necessary conditions for representation based approaches to work are, we prove an information-theoretic lower bound on the joint error of both domains for any algorithm based on learning invariant representations. Our result complements the above upper bound and also extends the constructed example to more general settings. The lower bound sheds new light on this problem by characterizing a fundamental tradeoff between learning invariant representations and achieving small joint error on both domains when the marginal label distributions differ from source to target. Our lower bound directly implies that minimizing source error while achieving invariant representation will only increase the target error. We conduct experiments on real-world datasets that corroborate this theoretical implication. Together with the generalization upper bound, our results suggest that adaptation should be designed to align the label distribution as well when learning an invariant representation (c.f. Sec. 4.3). We believe these insights will be helpful to guide the future design of domain adaptation and representation learning algorithms.

2 Preliminary

We first introduce the notations used throughout this paper and review a theoretical model for domain adaptation (DA) (Kifer et al., 2004; Ben-David et al., 2007; Blitzer et al., 2008; Ben-David et al., 2010).

Notations  We use and to denote the input and output space, respectively. Similarly, stands for the representation space induced from by a feature transformation . Accordingly, we use

to denote the random variables which take values in

, respectively. In this work, domain corresponds to a distribution on the input space and a labeling function . In the domain adaptation setting, we use and to denote the source and target domains, respectively. A hypothesis is a function . The error of a hypothesis w.r.t. the labeling function under distribution is defined as: . When and

are binary classification functions, this definition reduces to the probability that

disagrees with under : . In this work, we focus on the deterministic setting where the output is given by a deterministic labeling function defined on the corresponding domain. For two functions and with compatible domains and ranges, we use to denote the function composition . Other notations will be introduced in the context when necessary.

2.1 Problem Setup

We consider the unsupervised domain adaptation problem where the learning algorithm has access to a set of labeled points sampled i.i.d. from the source domain and a set of unlabeled points sampled i.i.d. from the target domain. At a colloquial level, the goal of an unsupervised domain adaptation algorithm is to generalize well on the target domain by learning from labeled samples from the source domain as well as unlabeled samples from the target domain. Formally, let the risk of hypothesis be the error of w.r.t. the true labeling function under domain , i.e.,

. As commonly used in computational learning theory, we denote by

the empirical risk of on the source domain. Similarly, we use and to mean the true risk and the empirical risk on the target domain. The problem of domain adaptation considered in this work can be stated as: under what conditions and by what algorithms can we guarantee that a small training error implies a small test error ? Clearly, this goal is not always possible if the source and target domains are far away from each other.

2.2 A Theoretical Model for Domain Adaptation

To measure the similarity between two domains, it is crucial to define a discrepancy measure between them. To this end, Ben-David et al. (2010) proposed the -divergence to measure the distance between two distributions:

Definition 2.1 (-divergence).

Let be a hypothesis class on input space , and be the collection of subsets of that are the support of some hypothesis in , i.e., . The distance between two distributions and based on is: 111To be precise, Ben-David et al. (2007)’s original definition of -divergence has a factor of 2, we choose the current definition as the constant factor is inessential.

-divergence is particularly favorable in the analysis of domain adaptation with binary classification problems, and it had also been generalized to the discrepancy distance (Cortes et al., 2008; Mansour et al., 2009a, b; Cortes and Mohri, 2014)

for general loss functions, including the one for regression problems. Both

-divergence and the discrepancy distance can be estimated using finite unlabeled samples from both domains when

has a finite VC-dimension.

One flexibility of the -divergence is that its power on measuring the distance between two distributions can be controlled by the richness of the hypothesis class . To see this, first consider the situtation where is very restrictive so that it only contains the constant functions and . In this case, it can be readily verified by the definition that . On the other extreme, if contains all the measurable binary functions, then iff almost surely. In this case the -divergence reduces to the distance, or equivalently to the total variation, between the two distributions.

Given a hypothesis class , we define its symmetric difference w.r.t. itself as: , where is the xor operation. Let be the optimal hypothesis that achieves the minimum joint risk on both the source and target domains: , and let denote the joint risk of the optimal hypothesis : . Ben-David et al. (2010) proved the following generalization bound on the target risk in terms of the source risk and the discrepancy between the source and target domains:

Theorem 2.1.

(Ben-David et al., 2010) Let be a hypothesis space of VC-dimension and (resp. ) be the empirical distribution induced by sample of size drawn from (resp. ). Then w.p.b. at least , ,

(1)

The bound depends on , the optimal joint risk that can be achieved by the hypotheses in . The intuition is the following: if is large, we cannot hope for a successful domain adaptation. Later in Sec. 4.3, we shall get back to this term to show an information-theoretic lower bound on it for any approach based on learning invariant representations.

Theorem 2.1 is the foundation of many recent works on unsupervised domain adaptation via learning invariant representations (Ajakan et al., 2014; Ganin et al., 2016; Zhao et al., 2018b; Pei et al., 2018; Zhao et al., 2018a). It has also inspired various applications of domain adaptation with adversarial learning, e.g., video analysis (Hoffman et al., 2016; Shrivastava et al., 2016; Hoffman et al., 2017; Tzeng et al., 2017), natural language understanding (Zhang et al., 2017; Fu et al., 2017), speech recognition (Zhao et al., 2017; Hosseini-Asl et al., 2018), to name a few.

At a high level, the key idea is to learn a rich and parametrized feature transformation such that the induced source and target distributions (on ) are close, as measured by the -divergence. We call an invariant representation w.r.t.  if , where is the induced source/target distribution. At the same time, these algorithms also try to find new hypothesis (on the representation space ) to achieve a small empirical error on the source domain. As a whole algorithm, these two procedures corresponds to simultaneously finding invariant representations and hypothesis to minimize the first two terms in the generalization upper bound of Theorem 2.1.

3 Related Work

A number of adaptation approaches based on learning invariant representations have been proposed in recent years. Although in this paper we mainly focus on using the -divergence to characterize the discrepancy between two distributions, other distance measures can be used as well, e.g., the maximum mean discrepancy (MMD) (Long et al., 2014, 2015, 2016), the Wasserstein distance (Courty et al., 2017b, a; Shen et al., 2018; Lee and Raginsky, 2018), etc.

Under the theoretical framework of the -divergence, Ganin et al. (2016)

propose a domain adversarial neural network (DANN) to learn the domain invariant features. Adversarial training techniques that aim to build feature representations that are indistinguishable between source and target domains have been proposed in the last few years 

(Ajakan et al., 2014; Ganin et al., 2016). Specifically, one of the central ideas is to use neural networks, which are powerful function approximators, to approximate the -divergence between two domains (Kifer et al., 2004; Ben-David et al., 2007, 2010). The overall algorithm can be viewed as a zero-sum two-player game: one network tries to learn feature representations that can fool the other network, whose goal is to distinguish the representations generated on the source domain from those generated on the target domain. In this work, by showing an information-theoretic lower bound on the joint error of these methods, we show that although invariant representations can be achieved, it does not necessarily translate to a good generalization on the target domain, in particular when the label distributions of the two domains differ significantly.

(Anti-) Causal approaches based on conditional and label shifts for domain adaptation also exist (Zhang et al., 2013; Gong et al., 2016; Lipton et al., 2018; Azizzadenesheli et al., 2018). One typical assumption made to simplify the analysis in this line of work is that the source and target domains share the same generative distribution and only differ at the marginal label distributions. Note that this contrasts with the classic covariate shift assumption which instead assumes the source and target domains share the same conditional distribution and differ at the marginal data distribution. It is worth noting that Zhang et al. (2013) showed that conditional shift can be successfully corrected when the changes in follow some parametric families. In this work we focus on representation learning and do not make such explicit assumptions.

4 Theoretical Analysis

Is finding invariant representations alone a sufficient condition for the success of domain adaptation? Clearly it is not. Consider the following simple counterexample: let be a constant function, where , . Then for any discrepancy distance over two distributions, including the -divergence, MMD, and the Wasserstein distance, and for any distributions over the input space , we have , where we use (resp. ) to mean the induced source (resp. target) distribution by the transformation over the representation space . Furthermore, it is fairly easy to construct source and target domains , , such that for any hypothesis , , while there exists a classification function that achieves small error, e.g., the labeling function.

One may argue, with good reason, that in the counterexample above, the empirical source error is also large with high probability. Intuitively, this is because the simple constant transformation function fails to retain the discriminative information about the classification task at hand, despite the fact that it can construct invariant representations.

Is finding invariant representations and achieving a small source error sufficient to guarantee small target error? In this section we first give a negative answer to this question by constructing a counterexample where there exists a nontrivial transformation function and hypothesis such that both and are small, while at the same time the target error is large. Motivated by this negative result, we proceed to prove a generalization upper bound that explicitly characterizes a sufficient condition for the success of domain adaptation. We then complement the upper bound by showing an information-theoretic lower bound on the joint error of any domain adaptation approach based on learning invariant representations.

4.1 Invariant Representation and Small Source Risk are Not Sufficient

In this section, we shall construct a simple 1-dimensional example where there exists a function that achieves zero error on both source and target domains. Simultaneously, we show that there exists a continuous transformation function under which the induced source and target distributions are perfectly aligned, but every hypothesis incurs a large joint error on the induced source and target domains. The latter further implies that if we find a hypothesis that achieves small error on the source domain, then it has to incur a large error on the target domain. We illustrate this example in Fig. 1.

Let and . For , we use

to denote the uniform distribution over

. Consider the following source and target domains:

In the above example, it is easy to verify that the interval hypothesis iff achieves perfect classification on both domains.

Now consider the following continuous transformation:

Since is a piecewise linear function, it follows that , and for any distance metric over distributions, we have . But now for any hypothesis , and , will make an error in exactly one of the domains, hence

In other words, under the above invariant transformation , the smaller the source error, the larger the target error.

One may argue that this example seems to contradict the generalization upper bound from Theorem 2.1, where the first two terms correspond exactly to a small source error and an invariant representation. The key to explain this apparent contradiction lies in the third term of the upper bound, , i.e., the optimal joint error achievable on both domains. In our example, when there is no transformation applied to the input space, we show that achieves 0 error on both domains, hence . However, when the transformation is applied to the original input space, we prove that every hypothesis has joint error 1 on the representation space, hence . Since we usually do not have access to the optimal hypothesis on both domains, although the generalization bound still holds on the representation space, it becomes vacuous in our example.

An alternative way to interpret the failure of the constructed example is that the labeling functions (or conditional distributions in the stochastic setting) of source and target domains are far away from each other in the representation space. Specifically, in the induced representation space, the optimal labeling function on the source and target domains are:

and we have .

4.2 A Generalization Upper Bound

For most of the practical hypothesis spaces , e.g., half spaces, it is usually intractable to compute the optimal joint error from Theorem 2.1. Furthermore, the fact that contains errors from both domains makes the bound very conservative and loose in many cases. In this section, inspired by the constructed example from Sec. 4.1, we aim to provide a general, intuitive, and interpretable generalization upper bound for domain adaptation that is free of the pessimistic term. Ideally, the bound should also explicitly characterize how the shift between labeling functions of both domains affects domain adaptation. Due to space constraints, we refer the interested reader to the Appendix for the proofs of our technical lemmas, and mainly focus in the following on interpretations and results.

Because of its flexibility in choosing the witness function class and its natural interpretation as adversarial binary classification, we still adopt the -divergence to measure the discrepancy between two distributions. For any hypothesis space , it can be readily verified that satisfies the triangular inequality:

where are any distributions over the same space. We now introduce a technical lemma that will be helpful in proving results related to the -divergence:

Lemma 4.1.

Let and be two distributions over . Then , , where .

As a matter of fact, the above lemma also holds for any function class (not necessarily a hypothesis space) where there exists a constant , such that for all . Another useful lemma is the following triangular inequality:

Lemma 4.2.

Let and be any distribution over . For any , we have .

Let and be the optimal labeling functions on the source and target domains, respectively. In the stochastic setting,

corresponds to the optimal Bayes classifier. With these notations, the following theorem holds:

Theorem 4.1.

Let and be the source and target domains, respectively. For any function class , and , the following inequality holds:

Remark  The three terms in the upper bound have natural interpretations: the first term is the source error, the second one corresponds to the discrepancy between the marginal distributions, and the third measures the distance between the labeling functions from the source and target domains. Altogether, they form a sufficient condition for the success of domain adaptation: besides a small source error, not only do the marginal distributions need to be close, but so do the labeling functions.

Comparison with Theorem 2.1. It is instructive to compare the bound in Theorem 4.1 with the one in Theorem 2.1. The main difference lies in the in Theorem 2.1 and the in Theorem 4.1. depends on the choice of the hypothesis class , while our term does not. In fact, our quantity reflects the underlying structure of the problem, i.e., the conditional shift. Finally, consider the example given in the left panel of Fig. 1. It is easy to verify that we have in this case, while for a natural class of hypotheses, i.e., , we have . In that case, our bound is tighter than the one in Theorem 2.1.

In the covariate shift setting, where we assume the conditional distributions between the source and target domains are the same, the third term in the upper bound vanishes. In that case the above theorem says that to guarantee successful domain adaptation, it suffices to match the marginal distributions while achieving small error on the source domain. In general settings where the optimal labeling functions of the source and target domains differ, the above bound says that it is not sufficient to simply match the marginal distributions and achieve small error on the source domain. At the same time, we should also guarantee that the optimal labeling functions (or the conditional distributions of both domains) are not too far away from each other. As a side note, it is easy to see that and . In other words, they are essentially the cross-domain errors. When the cross-domain error is small, it implies that the optimal source (resp. target) labeling function generalizes well on the target (resp. source) domain.

Both the error term and the divergence in Theorem 4.1 are with respect to the true underlying distributions and , which are not available to us during training. In the following, we shall use the Rademacher complexity to provide for both terms a data-dependent bound from empirical samples from and .

Definition 4.1 (Empirical Rademacher Complexity).

Let be a family of functions mapping from to and a fixed sample of size with elements in . Then, the empirical Rademacher complexity of with respect to the sample is defined as:

where and are i.i.d. uniform random variables taking values in .

With the empirical Rademacher complexity, we can show that w.h.p., the empirical source error cannot be too far away from the population error for all :

Lemma 4.3.

Let , then for all , w.p.b. at least , the following inequality holds for all : .

Similarly, for any distribution over , let be its empirical distribution from sample of size . Then for any two distributions and , we can also use the empirical Rademacher complexity to provide a data-dependent bound for the perturbation between and :

Lemma 4.4.

Let , and be defined above, then for all , w.p.b. at least , the following inequality holds for all : .

Since is a hypothesis class, by definition we have:

Hence combining the above identity with Lemma 4.4, we immediately have w.p.b. at least :

(2)

Now use a union bound and the fact that satisfies the triangle inequality, we have:

Lemma 4.5.

Let , and be defined above, then for , w.p.b. at least , for :

Combine Lemma 4.3, Lemma 4.5 and Theorem 4.1 with a union bound argument, we get the following main theorem that characterizes an upper bound for domain adaptation:

Theorem 4.2.

Let and be the source and target domains, and let be the empirical source and target distributions constructed from sample , each of size . Then for any function class and :

where .

Essentially, the generalization upper bound can be decomposed into three parts: the first part comes from the domain adaptation setting, including the empirical source error, the empirical -divergence, and the shift between labeling functions. The second part corresponds to complexity measures of our hypothesis space and , and the last part describes the error caused by finite samples.

4.3 An Information-Theoretic Lower Bound

In Sec. 4.1, we constructed an example to demonstrate that learning invariant representations could lead to a feature space where the joint error on both domains is large. In this section, we extend the example by showing that a similar result holds in more general settings. Specifically, we shall prove that for any approach based on learning invariant representations, there is an intrinsic lower bound on the joint error of source and target domains, due to the discrepancy between their marginal label distributions. Our result hence highlights the need to take into account task related information when designing domain adaptation algorithms based on learning invariant representations.

Before we proceed to the lower bound, we first define several information-theoretic concepts that will be used in the analysis. For two distributions and , the Jensen-Shannon (JS) divergence is defined as:

where is the Kullback–Leibler (KL) divergence and . The JS divergence can be viewed as a symmetrized and smoothed version of the KL divergence, and it is closely related to the distance (total variation) between two distributions through Lemma B.3 in the Appendix.

Unlike the KL divergence, the JS divergence is bounded: . Additionally, from the JS divergence, we can define a distance metric between two distributions as well, known as the JS distance (Endres and Schindelin, 2003):

With respect to the JS distance and for any deterministic mapping , we can prove the following lemma via the data processing inequality:

Lemma 4.6.

Let and be two distributions over and let and be the induced distributions over by function , then

(3)

For methods that aim to learn invariant representations for domain adaptation, an intermediate representation space is found through feature transformation , based on which a common hypothesis is shared between both domains (Ganin et al., 2016; Tzeng et al., 2017; Zhao et al., 2018b)

. Through this process, the following Markov chain holds:

(4)

where is the predicted random variable of interest. Hence for any distribution over , this Markov chain also induces a distribution over and over . By Lemma 4.6, we know that . With these notations, noting that the JS distance is a metric, the following inequality holds:

Combining the above inequality with Lemma 4.6, we immediately have:

(5)

Intuitively, and measure the distance between the predicted label distribution and the ground truth label distribution on the source and target domain, respectively. With the help of Lemma B.3, the following result establishes a relationship between and the accuracy of the prediction function :

Lemma 4.7.

Let where is the labeling function and be the prediction function, then .

We are now ready to present the main result of the section:

Theorem 4.3.

Suppose the Markov chain holds, then

Remark  This theorem shows that if the marginal label distributions are significantly different between the source and target domains, then in order to achieve a small joint error, the induced distributions over from source and target domains have to be significantly different as well. Put another way, if we are able to find an invariant representation such that , then the joint error of the composition function has to be large:

Corollary 4.1.

Suppose the condition in Theorem 4.3 holds and , then:

Remark  The lower bound gives us a necessary condition on the success of any domain adaptation approach based on learning invariant representations: if the marginal label distributions are significantly different between source and target domains, then minimizing and the source error will only increase the target error. In particular, if the transformation is rich enough such that and , then .

We conclude this section by noting that our bound on the joint error of both domains is not necessarily the tightest one. This can be seen from the example in Sec. 4.1, where , and we have , but in this case our result gives a trivial lower bound of 0. Nevertheless, our result still sheds new light on the importance of matching marginal label distributions in learning invariant representation for domain adaptation, which we believe to be a promising direction for the design of better adaptation algorithms.

5 Experiments

Our theoretical results on the lower bound of the joint error imply that over-training the feature transformation function and the discriminator may hurt generalization on the target domain. In this section, we conduct experiments on real-world datasets to verify our theoretical findings. The task is digit classification on three datasets of 10 classes: MNIST (LeCun et al., 1998), USPS (Dheeru and Karra Taniskidou, 2017) and SVHN (Netzer et al., 2011). MNIST contains 60,000/10,000 train/test instances; USPS contains 7,291/2,007 train/test instances, and SVHN contains 73,257/26,032 train/test instances. We show the label distribution of these three datasets in Fig. 2. Among them, only MNIST has close to uniform label distributions.

Figure 2: The label distributions of MNIST, USPS and SVHN.
(a) USPS MNIST
(b) USPS SVHN
(c) SVHN MNIST
(d) SVHN USPS
Figure 3: Digit classification on MNIST, USPS and SVHN. The horizontal solid line corresponds to the target domain test accuracy without adaptation. The green solid line is the target domain test accuracy under domain adaptation with DANN. We also plot the least square fit (dashed line) of the DANN adaptation results to emphasize the negative slope.

Before training, we preprocess all the samples into gray scale single-channel images of size , so they can be used by the same network. In our experiments, to ensure a fair comparison, we use the same network structure for all the experiments: 2 convolutional layers, one fully connected hidden layer, followed by a softmax output layer with 10 units. The convolution kernels in both layers are of size , with 10 and 20 channels, respectively. The hidden layer has 1280 units connected to 100 units before classification. For domain adaptation, we use the original DANN (Ganin et al., 2016) with gradient reversal implementation. The discriminator in DANN takes the output of convolutional layers as its feature input, followed by a fully connected layer, and a one-unit binary classification output. For all the experiments, we use the Adadelta method for optimization.

We plot four adaptation trajectories in Fig. 3

. Among the four adaptation tasks, we can observe two phases in the adaptation accuracy. In the first phase, the test set accuracy rapidly grows, in less than 10 iterations. In the second phase, it gradually decreases after reaching its peak, despite the fact that the source training accuracy keeps increasing smoothly. Those phase transitions can be verified from the negative slopes of the least squares fit of the adaptation curves (dashed lines in Fig. 

3). We observe similar phenomenons on additional experiments using artificially unbalanced datasets trained on more powerful networks in Appendix C. The above experimental results imply that over-training the feature transformation and discriminator does not help generalization on the target domain, but can instead hurt it when the label distributions differ (as shown in Fig. 2). These experimental results are consistent with our theoretical findings.

6 Conclusion and Future Work

In this paper we theoretically and empirically study the important problem of learning invariant representations for domain adaptation. We show that learning an invariant representation and achieving a small source error is not enough to guarantee target generalization. We then prove both upper and lower bounds for the target and joint errors, which directly translate to sufficient and necessary conditions for the success of domain adaptation. We believe our results take an important step towards understanding deep domain adaptation, and also stimulate future work on the design of stronger deep domain adaptation algorithms that align conditional distributions. Another interesting direction for future work is to characterize what properties the feature transformation function should have in order to decrease the conditional shift between source and target domains. It is also worth investigating under which conditions the label distributions can be aligned without explicit labeled data from the target domain.

References

Appendix A Missing Proofs

See 4.1

Proof.

By definition, for , we have:

(6)

Since , then , . We now use Fubini’s theorem to bound :

Now in view of (6) and the definition of , we have:

Combining all the inequalities above finishes the proof. ∎

See 4.2

Proof.

See 4.1

Proof.

On one hand, with Lemma 4.1 and Lemma 4.2, we have :

On the other hand, by changing the order of two triangle inequalities, we also have:

Realize that by definition and . Combining the above two inequalities completes the proof. ∎

See 4.3

Proof.

Consider the source domain . For , define the loss function as . First, we know that where we slightly abuse the notation to mean the family of functions :