Bridging Theory and Algorithm for Domain Adaptation

04/11/2019 ∙ by Yuchen Zhang, et al. ∙ 10

This paper addresses the problem of unsupervised domain adaption from theoretical and algorithmic perspectives. Existing domain adaptation theories naturally imply minimax optimization algorithms, which connect well with the adversarial-learning based domain adaptation methods. However, several disconnections still form the gap between theory and algorithm. We extend previous theories (Ben-David et al., 2010; Mansour et al., 2009c) to multiclass classification in domain adaptation, where classifiers based on scoring functions and margin loss are standard algorithmic choices. We introduce a novel measurement, margin disparity discrepancy, that is tailored both to distribution comparison with asymmetric margin loss, and to minimax optimization for easier training. Using this discrepancy, we derive new generalization bounds in terms of Rademacher complexity. Our theory can be seamlessly transformed into an adversarial learning algorithm for domain adaptation, successfully bridging the gap between theory and algorithm. A series of empirical studies show that our algorithm achieves the state-of-the-art accuracies on challenging domain adaptation tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

GraphDA

Unsupervised Domain Adaptation on Graphs


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

It is a common assumption in learning theories that training and test data are drawn from identical distribution. Thus, if the source domain from which we train a supervised learner, is substantially dissimilar to the target domain on which the learner is applied, there are no possibilities for good generalization across domains. However, we may expect to train a model by leveraging labeled data from similar domains. Domain adaptation is the particular machine learning task dealing with the setting where the distributions of training and test data are distinct with one another

(Quionero-Candela et al., 2009; Pan & Yang, 2010).

Remarkable theoretical advances have been achieved in domain adaptation. Ben-David et al. (2010); Mansour et al. (2009c)

provide rigorous learning bounds for unsupervised domain adaptation, a most challenging scenario in this field. These earliest theories have later been extended in many ways, from loss functions to bayesian settings to regression problems

(Mohri & Medina, 2012; Germain et al., 2013; Cortes et al., 2015). In addition, theories based on weighted combination of hypotheses have also been developed for multiple source domain adaptation (Crammer et al., 2008; Mansour et al., 2009b, a; Hoffman et al., 2018a).

On par with the theoretical findings, there are rich advances in domain adaptation algorithms. Previous work explored various techniques for statistics matching (Pan et al., 2011; Tzeng et al., 2014; Long et al., 2015, 2017) and discrepancy minimization (Ganin & Lempitsky, 2015; Ganin et al., 2016). Among them, adversarial-learning methods come with relatively strong theoretical guarantees. Inspired by Goodfellow et al. (2014), these methods build on the two-player game between domain discriminator and feature extractor. Recent works explored adversarial learning in diverse ways, yielding state-of-the-art results on many tasks (Tzeng et al., 2017; Saito et al., 2018; Long et al., 2018).

While many domain adaptation algorithms can be roughly interpreted as minimizing the distribution discrepancy in domain adaptation theories, several disconnections form non-negligible gap between the theories and algorithms. When designing a domain adaptation algorithm using scoring functions, we are in risk that our algorithm is not theoretically guaranteed since there is a gap between the loss used in the theories and algorithms. Furthermore, there is another gap between the hypothesis-induced discrepancies in theories and the widely-used divergences in algorithms, including Jensen Shannon Divergence (Ganin & Lempitsky, 2015), Maximum Mean Discrepancy (Gretton et al., 2012; Long et al., 2015), and Wasserstein Distance (Courty et al., 2017).

This work aims to bridge the gaps between the theories and algorithms for domain adaptation. We present a novel theoretical analysis of classification task in domain adaptation towards explicit guidance for algorithm design. We extend existing theories to classifiers based on scoring functions and margin loss, which are standard choices for real tasks. We define a new divergence, margin disparity discrepancy, and provide margin-aware generalization bounds based on Rademacher complexity. This generalization bound shows that there is a trade-off between generalization error and the choice of margin. Our theory can be seamlessly transformed into an adversarial learning algorithm for domain adaptation. A series of empirical studies show that our algorithm achieves state-of-the-art accuracies on challenging tasks.

2 Preliminaries

In this section we introduce some basic notations and assumptions for classification problems in domain adaptation.

2.1 Learning Setup

In supervised learning setting, the learner receives a sample of

labeled points from , where is an input space and is an output space, which is in binary classification and in multiclass classification. The sample is denoted by if independently drawn according to the distribution .

In unsupervised domain adaptation, there are two different distributions, the source and the target . The learner is trained on a set consisting of a labeled sample drawn from the source distribution and an unlabeled sample drawn from the target distribution.

Following the notations of Mohri et al. (2012), we consider multiclass classification with hypothesis space of scoring functions , where the outputs on each dimension indicate the confidence of prediction. The predicted label associated to point is the one resulting in the largest score . Thus it induces a labeling function space containing from to :

(1)

The (expected) error rate and empirical error rate of a classifier with respect to distribution are given by

(2)

where is the indicator function.

Before further discussion, we assume the constant classifier and is closed under permutations of . For binary classification, this is equivalent to the assumption that for any , we have .

2.2 Margin Loss

In practice, the margin between data points and the classification surface plays a significant role in achieving strong generalization performance. Thus a margin theory for classification was developed by Koltchinskii et al. (2002), where the 0-1 loss is replaced by the margin loss.

Define the margin of a hypothesis at a labeled example as

(3)

The corresponding margin loss and empirical margin loss of a hypothesis is

(4)

where denotes function composition, and is

(5)

An important property is that for any and . Koltchinskii et al. (2002) showed that the margin loss leads to an informative generalization bound for classification. Based on this seminal work, we shall develop margin bounds for classification in domain adaptation.

3 Theoretical Guarantees

In this section, we give theoretical guarantees for domain adaptation. All proofs can be found in the Appendix.

To reduce the error rate on the target with labeled training data only on the source, the distributions and should not be dissimilar substantially. Thus a measurement of their discrepancy is crucial in domain adaptation theory.

In the seminal work of Ben-David et al. (2010), the -divergence was proposed to measure such discrepancy,

(6)

Mansour et al. (2009c) extended the -divergence to general loss functions, leading to the discrepancy distance:

(7)

where should be a bounded function satisfying symmetry and triangle inequality. Note that many widely-used losses, e.g. margin loss, do not satisfy these requirements.

With these discrepancy measures, generalization bounds based on VC-dimension and Rademacher complexity were rigorously derived for domain adaptation. While these theories have made influential impact in advancing algorithm designs, there are two crucial directions for improvement:

  1. Generalization bound for classification with scoring functions has not been formally studied in the domain adaptation setting. As scoring functions with margin loss provide informative generalization bound in standard classification, there is strong motivation to develop a margin theory for domain adaptation.

  2. The hypothesis-induced discrepancies require taking supremums over hypothesis space , while achieving lower generalization bound requires minimizing these discrepancies. These naturally result in minimax optimization problems that will become hard to reach equilibrium since is generally overly large.

These directions are the pain points in practical algorithm designs. When a practitioner designs a domain adaptation algorithm using scoring functions, he may suspect whether his algorithm is theoretically guaranteed since there is a gap between the loss functions used in the theories and algorithms. Furthermore, there is another gap between the hypothesis-induced discrepancies in theories and the widely-used divergences in domain adaptation algorithms, including Jensen Shannon Divergence (Ganin & Lempitsky, 2015), Maximum Mean Discrepancy (Gretton et al., 2012; Long et al., 2015), and Wasserstein Distance (Courty et al., 2017). In this work, we aim to bridge the gaps between the theories and algorithms for domain adaptation. Our margin theory is developed based on a novel margin disparity discrepancy.

3.1 Margin Disparity Discrepancy

First, we give an improved discrepancy for measuring the distribution difference by restricting the hypothesis space.

Given two hypotheses , we define the (expected) 0-1 disparity between them as

(8)

and the empirical 0-1 disparity as

(9)
Definition 3.1 (Disparity Discrepancy, DD).

Given a hypothesis space and a specific classifier , the Disparity Discrepancy induced by is defined by

(10)

Similarly, the empirical disparity discrepancy is

(11)

Note that the disparity discrepancy is not only dependent on the hypothesis space , but also on a chosen classifier . In the Appendix we shall prove that this discrepancy can well measure the difference of distributions (actually a pseudo-metric in the binary case). Furthermore, for binary classification the target expected risk can be controlled by a VC-dimension generalization bound via disparity discrepancy. Compared with the -divergence, the supremum in the disparity discrepancy is taken only over the hypothesis space and thus can be optimized more easily. This will significantly ease the minimax optimization widely used in many domain adaptation algorithms.

In the case of multiclass classification, the margin of scoring functions becomes an important factor for informative generalization bound, as envisioned by Koltchinskii et al. (2002). Existing domain adaptation theories (Ben-David et al., 2007, 2010; Blitzer et al., 2008; Mansour et al., 2009c) do not give a formal analysis of generalization bound with scoring functions and margin loss. Hence, to bridge the gap between theories that typically analyze labeling functions and losses of symmetry and subadditivity, and algorithms that widely adopt scoring functions and margin losses, we propose a margin based disparity discrepancy.

The margin disparity, i.e. disparity by changing the 0-1 loss to the margin loss, and its empirical version from hypothesis to are defined as

(12)

Note that here and are scoring functions while and are their labeling functions. Note also that the margin disparity is not a symmetric function on and , and the generalization theory w.r.t. this loss could be quite different from that for the discrepancy distance (Mansour et al., 2009c), which requires symmetry and subadditivity.

Definition 3.2 (Margin Disparity Discrepancy, MDD).

With the definition of margin disparity, we define Margin Disparity Discrepancy and its empirical version by

(13)

We demonstrate several properties to show that Margin Disparity Discrepancy (MDD) is well-defined and has the ability to measure distribution difference. First it is trivial to justify that and satisfies nonnegativity and subadditivity. Despite of its asymmetry, MDD is a well-defined discrepancy for domain adaptation due to the following proposition.

Proposition 3.3.

For any scoring function ,

(14)

where is a constant independent of .

This upper bound has a similar form with the learning bound proposed by Ben-David et al. (2010). is determined by the learning problem and can be reduced to a rather small value if the hypothesis space is rich enough. depicts the performance of on source domain and MDD bounds the performance gap caused by domain shift. This margin bound gives a new perspective for analyzing domain adaptation with respect to scoring functions and margin loss.

3.2 Domain Adaptation: Generalization Bounds

In this subsection, we provide several generalization bounds for multiclass domain adaptation based on margin loss and disparity discrepancy. First, we present a Rademacher complexity bound for the difference between MDD and its empirical version. Then, we combine the Rademacher bound for MDD and Proposition 3.3 to derive the final generalization bound. In addition, we bound the empirical Rademacher term by the notions of covering number and fat-shattering dimension for further interpretation.

To begin with, we introduce a new function class serving as a “scoring” version of the symmetric difference hypothesis space in (Ben-David et al., 2010).

Definition 3.4.

Given a class of scoring functions and a class of the induced classifiers , we define as

(15)

There is a geometric interpretation of the set (Galbis & Maestre, 2012). Assuming

is a manifold, assigning a vector space

to each point in yields a vector bundle . Now regarding the values of as one-hot vectors in , and are both sets of sections of

containing (probably piecewise continuous) vector fields.

can be seen as the space of inner products of vector fields from and ,

(16)

Now we introduce the Rademacher complexity, commonly used in the generalization theory as a measurement of richness for a particular hypothesis space (Mohri et al., 2012).


Definition 3.5 (Rademacher Complexity).

Let be a family of functions mapping from to and a fixed sample of size drawn from the distribution over . Then, the empirical Rademacher complexity of with respect to the sample is defined as

(17)

where

’s are independent uniform random variables taking values in

. The Rademacher complexity is

(18)

With the Rademacher complexity, we proceed to show that MDD can be well estimated through finite samples.

Lemma 3.6.

For any , with probability ,

(19)

Combining Proposition 3.3 and Lemma 3.6, we obtain a Rademacher complexity based generalization bound of the expected target error through the empirical MDD.

Theorem 3.7 (Generalization Bound).

For all , with probability ,

(20)

Compared with the bounds based on 0-1 loss and -divergence in Ben-David et al. (2010); Mansour et al. (2009c), this generalization bound is sharper and more informative. Through choosing a better margin , we could get better generalizablity on the target domain. Moreover, we point out that there is a trade-off between generalization and optimization in the choice of . For relatively small and rich hypothesis space, the first two terms do not differ too much according to so the right-hand side becomes smaller with the increase of . However, when is too large, these terms cannot be optimized to reach an acceptable small value.

Although we have developed the margin bound, the value of the empirical Rademacher complexity in Theorem 3.8 is still not explicit enough. We need to check the variation of with the growth of . To this end, we describe the notion of covering number from Zhou (2002); Anthony & Bartlett (2009); Talagrand (2014).

Intuitively a covering number is the minimal number of balls of radius needed to cover a class of bounded functions and can be interpreted as a measure of the richness of the class at scale . A rigorous definition is given in the Appendix together with a proof of the following covering number bound for MDD.

Theorem 3.8 (Generalization Bound by Covering Number).

Suppose is bounded in by . For , with probability ,

(21)

Although we have shown the generalization bounds for multiclass classification via covering number, a natural question is raised: what is the relationship between the covering number bound and traditional VC-dimension bound, especially when , i.e. multiclass classification degenerating into binary classification. To answer this we need the notion of fat-shattering dimension (Mendelson & Vershynin, 2003; Rakhlin & Sridharan, 2014). For concision, we only present the final result here and leave the definition and proof to the Appendix.

Corollary 3.9.

Let be the fat-shattering dimension of with scale and be the VC-dimension of . Then there exist constants independent of and such that for , with probability ,

(22)

Note that when the class number , the VC dimension . Thus this term coincides with Ben-David et al. (2010) in the order of sample complexity.

In summary, our theory is a bold attempt towards filling the two gaps mentioned at the beginning of this section. Firstly, we provide a thorough analysis for multiclass classification with domain adaptation. Secondly, our bound is based on scoring functions and margin loss. Thirdly, as the measure of distribution shift, MDD is defined by taking supremum over the hypothesis space which can be much smaller than , making the minimax optimization problem easier to solve.

4 Algorithm

According to the above theory, we propose an adversarial representation learning method for domain adaptation. Note that MDD not only sheds lights on understanding the role of margin in domain adaptation, but also leads to a strong and efficient algorithm with theoretical guarantees.

4.1 Minimax Optimization Problem

Recall that the expected error on target domain is bounded by the sum of four terms: empirical margin error on the source domain , empirical MDD , the ideal error and complexity terms. We need to solve the following minimization problem in hypothesis space :

(23)

Minimizing margin disparity discrepancy is a minimax problem since MDD is defined as the supremum over hypothesis space . Denote the feature extractor by . Applying to the source and target distributions, the overall optimization problem can be written as

(24)

To enable representation-based domain adaptation, we need to learn new representation such that MDD is minimized.

Now we design an adversarial learning algorithm to solve this problem by introducing an auxiliary classifier sharing the same hypothesis space with

. Also since the margin loss is hard to optimize via stochastic gradient descent in practice, we shall use a combination of loss functions

and in substitution to the margin loss, which well preserve the key property of the margin. The practical optimization problem in the adversarial learning can be stated as

(25)

where is the trade-off coefficient between source error and MDD , is designed to attain the margin (detailed in the next subsection) and

(26)

4.2 Combined Cross-Entropy Loss

As we have pointed above, multiclass margin loss or hinge loss causes the problem of gradient vanishing in stochastic gradient descent, and thus cannot be optimized efficiently, especially for representation learning that significantly relies on gradient propagation. To overcome this common issue, we choose different loss functions on source and target and reweight them to approximate MDD.

Denote the softmax function by , i.e. for

(27)

On the source domain, and are replaced by the cross-entropy loss

(28)

On the target domain, we use a modified cross-entropy loss

(29)

Note that this modification was introduced in Goodfellow et al. (2014) to mitigate the burden of exploding or vanishing gradients when performing adversarial learning.

Combining the above two terms with a coefficient , the objective of the auxiliary classifier can be formulated as

(30)

We shall see that training the feature extractor to minimize loss function (30) will lead to . For , we claim that the value of will reach and the margin of will be around at equilibrium. Thus we call it the margin factor. The detailed explanation is written in the Appendix.

On the other hand, as we have explained in Section 3, we cannot let or go to infinity. In fact, from an empirical view can only be chosen far beyond the theoretical optimal value since performing SGD for a large might lead to exploding gradients. In summary, the choice of is crucial in our method and we always prefer relatively larger in practice when exploding gradients are not encountered.

5 Experiments

We validate the proposed learning method on three benchmark datasets against deep domain adaptation methods. Codes and datasets will be available online.

Method A W D W W D A D D A W A Avg
ResNet-50 (He et al., 2016) 68.40.2 96.70.1 99.30.1 68.90.2 62.50.3 60.70.3 76.1
DAN (Long et al., 2015) 80.50.4 97.10.2 99.60.1 78.60.2 63.60.3 62.80.2 80.4
DANN (Ganin et al., 2016) 82.00.4 96.90.2 99.10.1 79.70.4 68.20.4 67.40.5 82.2
ADDA (Tzeng et al., 2017) 86.20.5 96.20.3 98.40.3 77.80.3 69.50.4 68.90.5 82.9
JAN (Long et al., 2017) 85.40.3 97.40.2 99.80.2 84.70.3 68.60.3 70.00.4 84.3
GTA (Sankaranarayanan et al., 2018) 89.50.5 97.90.3 99.80.4 87.70.5 72.80.3 71.40.4 86.5
MCD(Saito et al., 2018) 89.60.2 98.50.1 100.0.0 91.30.2 69.60.1 70.80.3 86.6
CDAN (Long et al., 2018) 94.10.1 98.60.1 100.0.0 92.90.2 71.00.3 69.30.3 87.7
MDD 94.50.3 98.40.1 100.0.0 93.50.2 74.60.3 72.20.1 88.9
Table 1: Accuracy (%) on Office-31 for unsupervised domain adaptation (ResNet-50)
Method ArCl ArPr ArRw ClAr ClPr ClRw PrAr PrCl PrRw RwAr RwCl RwPr Avg
ResNet-50 (He et al., 2016) 34.9 50.0 58.0 37.4 41.9 46.2 38.5 31.2 60.4 53.9 41.2 59.9 46.1
DAN (Long et al., 2015) 43.6 57.0 67.9 45.8 56.5 60.4 44.0 43.6 67.7 63.1 51.5 74.3 56.3
DANN (Ganin et al., 2016) 45.6 59.3 70.1 47.0 58.5 60.9 46.1 43.7 68.5 63.2 51.8 76.8 57.6
JAN (Long et al., 2017) 45.9 61.2 68.9 50.4 59.7 61.0 45.8 43.4 70.3 63.9 52.4 76.8 58.3
CDAN (Long et al., 2018) 50.7 70.6 76.0 57.6 70.0 70.0 57.4 50.9 77.3 70.9 56.7 81.6 65.8
MDD 54.9 73.7 77.8 60.0 71.4 71.8 61.2 53.6 78.1 72.5 60.2 82.3 68.1
Table 2: Accuracy (%) on Office-Home for unsupervised domain adaptation (ResNet-50)
Method Synthetic Real
JAN (Long et al., 2017) 61.6
GTA (Sankaranarayanan et al., 2018) 69.5
MCD (Saito et al., 2018) 69.8
CDAN (Long et al., 2018) 70.0
MDD 74.6
Table 3: Accuracy (%) on VisDA-2017 (ResNet-50)

5.1 Setup

Office-31 (Saenko et al., 2010) is a standard domain adaptation dataset of three diverse domains, Amazon from Amazon website, Webcam by web camera and DSLR by digital SLR camera with 31 unbalanced classes and 4,652 images.

Office-Home (Venkateswara et al., 2017) is a more complex dataset containing 15,500 images from four extremely distinct domains: Artistic images, Clip Art, Product images, and Real-world images. We do experiments with our methods on all twelve transfer tasks among these four domains.

VisDA2017 (Peng et al., 2017) is simulation-to-real dataset with two domains: Synthetic renderings of 3D models generated from different angles and with different lighting conditions and Real collected from photo-realistic or real-image datasets. Since the 3D models were generated in clean environment, the Synthetic domain is very different from the Real domain. With 280K images across 12 classes, the scale of VisDA2017 also brings challenges to domain adaptation.

We compare our model based on Margin Disparity Discrepancy (MDD) with state-of-the-art domain adaptation methods: Deep Adaptation Network (DAN) (Long et al., 2015)

, Domain Adversarial Neural Network (

DANN) (Ganin et al., 2016), Joint Adaptation Network (JAN) (Long et al., 2017), Adversarial Discriminative Domain Adaptation (ADDA) (Tzeng et al., 2017), Generate to Adapt (GTA) (Sankaranarayanan et al., 2018), Maximum Classifier Discrepancy (MCD) (Saito et al., 2018), and Conditional Domain Adversarial Network (CDAN) (Long et al., 2018).

We follow the commonly used experiment protocol for unsupervised domain adaptation from Ganin & Lempitsky (2015); Long et al. (2018). We report the average accuracies of five independent experiments. The importance-weighted cross-validation (IWCV) is employed in all experiments for the selection of hyper-parameters. The asymptotic value of coefficient is fixed to and is chosen from and kept the same for all tasks on the same dataset.

We implement our algorithm in PyTorch. For the deep learning experiments, ResNet-50 (He et al., 2016)

is adopted as the feature extractor with parameters fine-tuned from the pre-trained ImageNet

(Russakovsky et al., 2014). The main classifier and auxiliary classifier are both 2-layer neural networks with width 1024. For optimization, we use the mini-batch SGD with the Nestorov momentum 0.9. The minimax problem is tackled by introducing a gradient reversal layer (Ganin et al., 2016). The learning rate of the classifiers are set 10 times to that of the feature extractor, the value of which is adjusted according to Ganin et al. (2016).

5.2 Results

The results on Office-31 are reported in Table 1. We could see that MDD achieves state-of-the-art accuracies on five out of six transfer tasks. We notice that in previous works, feature alignment methods (JAN, CDAN) generally perform better for large-to-small tasks (AW, AD) while pixel-level adaptation methods (GTA) attend to get higher accuracy for small-to-large ones (WA, DA). Nevertheless our algorithm outperforms both types of methods on almost all task, demonstrating its effectiveness and universality.

Table 2 and 3 present the accuracies of our algorithm on Office-Home and VisDA. We make remarkable performance boost on all tasks. Remark that some of the methods listed in the tables use additional techniques such as the entropy minimization to enhance their performance. Our method possesses both simplicity and performance strength.

A W D A Avg on Office-31
1 92.5 70.9 87.5
2 93.7 73.0 88.1
3 94.0 73.7 88.5
4 94.5 74.6 88.9
Table 4: Accuracy (%) on Office-31 with different margins
(a) Test Accuracy
(b) Equilibrium on Source
(c) Equilibrium on Target
Figure 1: Test accuracy and the values of on transfer task D A.
(a) MDD w/o Minimization
(b) DD
(c) -MDD
(d) -MDD
Figure 2: Empirical values of the margin disparity discrepancy (MDD) computed by auxiliary classifier .

5.3 Verification and Analysis

In our adversarial learning algorithm, we reasonably use the combined cross-entropy loss instead of the margin loss and disparity discrepancy in our proposed theory. We need to show that despite the technical modification, our algorithm can well reduce empirical MDD computed according to :

(31)

We choose for comparison. The expected margin should reach and in the last two cases while there is no guarantee for margin with . Correspondingly, we examine DD (based on 0-1 loss), -MDD and -MDD for task DA and show the results in Figures 12.

First, we verify that without the minimization part of the adversarial training, the auxiliary classifier in Eq. (30) is close to the that maximizes MDD over . We solve this optimization problem by directly training without the auxiliary classifier and show our results in 2(a), where MDD reaches shortly after training begins, implying that the loss function we use can well substitute MDD.

Next, we consider the equilibrium of the minimax optimization. The average values of over source and target are presented in Figures 1(b) and 1(c). We could see that at the final training stage, is close to the predicted value, which gives rise to large margin.

Last, by visualizing the values of DD, -MDD and -MDD and test accuracy computed over the whole dataset every 100 steps, we could see that larger leads to smaller MDD and higher test accuracy. This conforms to our theory.

6 Related Work

Domain Adaptation Theory

The earliest theoretical work in this field was conducted by Ben-David et al. (2007). They proposed the -divergence as a substitution of traditional distribution discrepancies (e.g. total variation, KL-divergence), which overcame the difficulties of estimation from finite samples. Analysis via the -divergence yields a generalization bound for target expected risk measured by 0-1 loss in the setting of binary classification. Advancing this work, Mansour et al. (2009c) considered a general class of loss functions satisfying symmetry and subadditivity and developed a generalization theory with respect to the newly proposed discrepancy distance. Mohri & Medina (2012); Zhang et al. (2012) proposed -disc for domain adaptation with partially labeled target data. Cortes & Mohri (2014); Cortes et al. (2015) further proposed a theory for regression tasks in the presence of domain adaptation via the discrepancy distance and the generalized discrepancy, together with discrepancy minimization and importance reweighting algorithms. In addition, Germain et al. (2013) proposed a PAC-Bayesian theory for domain adaptation using the domain disagreement pseudometric. Previous works mainly developed the theories based on symmetric loss such as 0-1 loss and loss. We extend them to margin loss and scoring functions for multiclass classification.

Domain Adaptation Algorithm

With the increasing prevalence of deep learning, the domain adaptation methods based on deep neural networks have also achieved great success in recent years (Long et al., 2015; Ganin & Lempitsky, 2015)

. These works aim to learn domain-invariant representations by minimizing a certain discrepancy between distributions of source and target features extracted by a shared representation learner. With insights both from the theory of

Ben-David et al. (2010) and the practice of adversaral learning (Goodfellow et al., 2014), Ganin & Lempitsky (2015) put forward the domain adversarial neural network (DANN). Similar to the two-player game in Goodfellow et al. (2014), an auxiliary network named discriminator was trained to distinguish the source from target domain features and a feature extractor to confuse the discriminator by stochastic gradient descent. They also pointed out in that work that the success of DANN possibly resulted from the minimization of -divergence between feature distributions. Since then, a series of works have appeared and achieved significantly better performance (Tzeng et al., 2017; Long et al., 2018; Sankaranarayanan et al., 2018; Hoffman et al., 2018b). Tzeng et al. (2017) proposed an architecture that employed asymmetric encodings for target and source data. Long et al. (2018) presented a principled framework that conducted the adversarial adaptation models using semantic information. Hoffman et al. (2018b); Sankaranarayanan et al. (2018) unified pixel-level and feature-level adversarial learning for domain adaptation. Saito et al. (2018) considered the classifiers instead of features and designed an original adversarial learning method by maximizing classifier discrepancy in approximation of . However, they trained two classifiers on the source domain and used loss to approximate 0-1 loss, which may not be rigorously guaranteed by the theory.

7 Conclusion

We present a theoretical and algorithmic analysis of domain adaptation, which bridges the gap between theory and algorithm. We derive new generalization bounds in terms of Rademacher complexity. Our analysis is more suitable for analyzing real-world domain adaptation problems, and the theory-induced algorithm achieves state-of-the-art results.

References

  • Anthony & Bartlett (2009) Anthony, M. and Bartlett, P. L. Neural network learning: Theoretical foundations. cambridge university press, 2009.
  • Ben-David et al. (2007) Ben-David, S., Blitzer, J., Crammer, K., and Pereira, F. Analysis of representations for domain adaptation. In Advances in Neural Information Processing Systems (NIPS), 2007.
  • Ben-David et al. (2010) Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. W. A theory of learning from different domains. Machine Learning, 79(1-2):151–175, 2010.
  • Blitzer et al. (2008) Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Wortman, J. Learning bounds for domain adaptation. In Advances in Neural Information Processing Systems 20, pp. 129–136. 2008.
  • Cortes & Mohri (2014) Cortes, C. and Mohri, M. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519:103–126, 2014.
  • Cortes et al. (2015) Cortes, C., Mohri, M., and Muñoz Medina, A. Adaptation algorithm and theory based on generalized discrepancy. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 169–178. ACM, 2015.
  • Courty et al. (2017) Courty, N., Flamary, R., Habrard, A., and Rakotomamonjy, A. Joint distribution optimal transportation for domain adaptation. In Advances in Neural Information Processing Systems (NIPS), pp. 3730–3739. 2017.
  • Crammer et al. (2008) Crammer, K., Kearns, M., and Wortman, J. Learning from multiple sources. Journal of Machine Learning Research, 9(Aug):1757–1774, 2008.
  • Galbis & Maestre (2012) Galbis, A. and Maestre, M. Vector analysis versus vector calculus. Springer Science & Business Media, 2012.
  • Ganin & Lempitsky (2015) Ganin, Y. and Lempitsky, V.

    Unsupervised domain adaptation by backpropagation.

    In International Conference on Machine Learning (ICML), 2015.
  • Ganin et al. (2016) Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., and Lempitsky, V. Domain-adversarial training of neural networks. The Journal of Machine Learning Research (JMLR), 17(1):2096–2030, 2016.
  • Germain et al. (2013) Germain, P., Habrard, A., Laviolette, F., and Morvant, E. A pac-bayesian approach for domain adaptation with specialization to linear classifiers. In Proceedings of the 30th International Conference on Machine Learning, pp. 738–746. PMLR, 2013.
  • Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems (NIPS), 2014.
  • Gretton et al. (2012) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., and Smola, A. A kernel two-sample test. Journal of Machine Learning Research (JMLR), 13:723–773, 2012.
  • He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2016.
  • Hoffman et al. (2018a) Hoffman, J., Mohri, M., and Zhang, N. Algorithms and theory for multiple-source adaptation. In Advances in Neural Information Processing Systems 31, pp. 8256–8266. 2018a.
  • Hoffman et al. (2018b) Hoffman, J., Tzeng, E., Park, T., Zhu, J.-Y., Isola, P., Saenko, K., Efros, A., and Darrell, T. CyCADA: Cycle-consistent adversarial domain adaptation. In Proceedings of the 35th International Conference on Machine Learning, pp. 1989–1998, 2018b.
  • Koltchinskii et al. (2002) Koltchinskii, V., Panchenko, D., et al. Empirical margin distributions and bounding the generalization error of combined classifiers. The Annals of Statistics, 30(1):1–50, 2002.
  • Long et al. (2015) Long, M., Cao, Y., Wang, J., and Jordan, M. I. Learning transferable features with deep adaptation networks. In International Conference on Machine Learning (ICML), 2015.
  • Long et al. (2017) Long, M., Wang, J., and Jordan, M. I.

    Deep transfer learning with joint adaptation networks.

    In International Conference on Machine Learning (ICML), 2017.
  • Long et al. (2018) Long, M., CAO, Z., Wang, J., and Jordan, M. I. Conditional adversarial domain adaptation. In Advances in Neural Information Processing Systems 31, pp. 1647–1657. 2018.
  • Mansour et al. (2009a) Mansour, Y., Mohri, M., and Rostamizadeh, A. Multiple source adaptation and the rényi divergence. In

    Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence

    , pp. 367–374. AUAI Press, 2009a.
  • Mansour et al. (2009b) Mansour, Y., Mohri, M., and Rostamizadeh, A. Domain adaptation with multiple sources. In Advances in Neural Information Processing Systems 21, pp. 1041–1048. 2009b.
  • Mansour et al. (2009c) Mansour, Y., Mohri, M., and Rostamizadeh, A. Domain adaptation: Learning bounds and algorithms. In 22nd Conference on Learning Theory, COLT 2009, 2009c.
  • Mendelson & Vershynin (2003) Mendelson, S. and Vershynin, R. Entropy and the combinatorial dimension. Inventiones mathematicae, 152(1):37–55, 2003.
  • Mohri & Medina (2012) Mohri, M. and Medina, A. M. New analysis and algorithm for learning with drifting distributions. In International Conference on Algorithmic Learning Theory, pp. 124–138. Springer, 2012.
  • Mohri et al. (2012) Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of machine learning. 2012.
  • Pan & Yang (2010) Pan, S. J. and Yang, Q. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering (TKDE), 22(10):1345–1359, 2010.
  • Pan et al. (2011) Pan, S. J., Tsang, I. W., Kwok, J. T., and Yang, Q. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks (TNN), 22(2):199–210, 2011.
  • Peng et al. (2017) Peng, X., Usman, B., Kaushik, N., Hoffman, J., Wang, D., and Saenko, K. Visda: The visual domain adaptation challenge. CoRR, abs/1710.06924, 2017.
  • Quionero-Candela et al. (2009) Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. D. Dataset shift in machine learning. The MIT Press, 2009.
  • Rakhlin & Sridharan (2014) Rakhlin, A. and Sridharan, K. Statistical learning and sequential prediction. Book Draft, 2014.
  • Russakovsky et al. (2014) Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. ImageNet Large Scale Visual Recognition Challenge. 2014.
  • Saenko et al. (2010) Saenko, K., Kulis, B., Fritz, M., and Darrell, T. Adapting visual category models to new domains. In European Conference on Computer Vision (ECCV), 2010.
  • Saito et al. (2018) Saito, K., Watanabe, K., Ushiku, Y., and Harada, T. Maximum classifier discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3723–3732, 2018.
  • Sankaranarayanan et al. (2018) Sankaranarayanan, S., Balaji, Y., Castillo, C. D., and Chellappa, R. Generate to adapt: Aligning domains using generative adversarial networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • Talagrand (2014) Talagrand, M. Upper and lower bounds for stochastic processes: modern methods and classical problems, volume 60. Springer Science & Business Media, 2014.
  • Tzeng et al. (2014) Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., and Darrell, T. Deep domain confusion: Maximizing for domain invariance. CoRR, abs/1412.3474, 2014.
  • Tzeng et al. (2017) Tzeng, E., Hoffman, J., Saenko, K., and Darrell, T. Adversarial discriminative domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • Venkateswara et al. (2017) Venkateswara, H., Eusebio, J., Chakraborty, S., and Panchanathan, S. Deep hashing network for unsupervised domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • Zhang et al. (2012) Zhang, C., Zhang, L., and Ye, J. Generalization bounds for domain adaptation. In Advances in Neural Information Processing Systems 25, pp. 3320–3328. 2012.
  • Zhou (2002) Zhou, D.-X. The covering number in learning theory. Journal of Complexity, 18(3):739–767, 2002.