Unsupervised Domain Adaptation Based on Source-guided Discrepancy

09/11/2018 ∙ by Seiichi Kuroki, et al. ∙ 2

Unsupervised domain adaptation is the problem setting where data generating distributions in the source and target domains are different, and labels in the target domain are unavailable. One important question in unsupervised domain adaptation is how to measure the difference between the source and target domains. A previously proposed discrepancy that does not use the source domain labels requires high computational cost to estimate and may lead to a loose generalization error bound in the target domain.To mitigate these problems, we propose a novel discrepancy called source-guided discrepancy (S-disc), which exploits labels in the source domain. As a consequence, S-disc can be computed efficiently with a finite sample convergence guarantee. In addition, we show that S-disc can provide a tighter generalization error bound than the one based on an existing discrepancy. Finally, we report experimental results that demonstrate the advantages of S-disc over the existing discrepancies.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

In the conventional supervised learning framework, we often assume the training and test distributions are the same. However, this assumption may not hold in many practical applications such as spam filtering, sentiment analysis 

(Glorot, Bordes, and Bengio, 2011)

, natural language processing 

(Jiang and Zhai, 2007), speech recognition (Sun et al., 2017)

, and computer vision 

(Saito, Ushiku, and Harada, 2017). For instance, a personalized spam filter can be trained from the emails of all available users, but the training data may not represent the emails of the target user. Such scenarios can be formulated in the framework of domain adaptation, which has been studied extensively (Ben-David et al., 2007; Mansour, Mohri, and Rostamizadeh, 2009a; Zhang, Zhang, and Ye, 2012)

. One important goal in domain adaptation is to find the classifier in the label-scarce target domain by exploiting the label-rich source domains. In particular, there are cases where we only have access to labeled data from the source domain and unlabeled data from the target domain due to the privacy and expensive annotation costs. This problem setting is called

unsupervised domain adaptation, which is our interest in this paper.

Since domain adaptation cannot be done effectively if the source and target domains are too different, one important question is how to measure the difference between the two domains. Many discrepancies have been used such as Wasserstein distance (Courty et al., 2017), Rényi divergence (Mansour, Mohri, and Rostamizadeh, 2009b), maximum mean discrepancy (Huang et al., 2007), and KL-divergence (Sugiyama et al., 2008).

Apart from the previously mentioned discrepancies, Ben-David et al. (2007) proposed a discrepancy for binary classification that takes a hypothesis class into account. They showed that the discrepancy gives a tighter generalization error bound than distance, which does not use information of the hypothesis class. Following this line of research, Mansour, Mohri, and Rostamizadeh (2009a) generalized the discrepancy of Ben-David et al. (2007)

to arbitrary loss functions. This line of research provided a theoretical foundation for the various applications of domain adaptation including source selection 

(Bhatt, Rajkumar, and Roy, 2016), sentiment analysis (Glorot, Bordes, and Bengio, 2011), and computer vision (Saito et al., 2018). However, their discrepancy considers the worst pair of hypotheses to bound the maximum gap of the loss between the domains, which may lead to a loose generalization error bound and expensive computational cost. Zhang, Zhang, and Ye (2012) and Mohri and Medina (2012) analyzed another discrepancy with a generalization error bound. Nevertheless, this discrepancy cannot be computed without target labels and therefore not suitable for unsupervised domain adaptation.

To alleviate the limitations of the existing discrepancies for domain adaptation, we propose a novel discrepancy called source-guided discrepancy (-disc). By incorporating the source labels, -disc provides a tighter generalization error bound than the previously proposed discrepancy by Mansour, Mohri, and Rostamizadeh (2009a). Furthermore, we provide an estimator with an efficient algorithm in binary classification. We also derive the consistency and a convergence rate of the estimator of -disc. Our main contributions are as follows.

  • We propose a novel discrepancy called source-guided discrepancy (-disc), which uses the source labels to measure the difference between the two domains for unsupervised domain adaptation (Definition 2).

  • We propose an efficient algorithm for the estimation of the -disc for the 0-1 loss (Algorithm 1).

  • We show the consistency and derive a convergence rate of the estimator of -disc (Theorem 4).

  • We derive a generalization error bound in the target domain based on -disc (Theorem 7), which is tighter than the existing bound provided by Mansour, Mohri, and Rostamizadeh (2009a). In addition, we derive a generalization error bound for the finite sample case (Theorem 8).

  • We demonstrate the effectiveness of -disc for unsupervised domain adaptation in the experiments.

Problem Setting and Notation

In this section, we formulate the unsupervised domain adaptation problem. Let be the input space and be the output space, which is in binary classification. We define a domain as a pair , where is an input distribution on and is a labeling function. In domain adaptation, we denote the source domain and the target domain as (,) and (,), respectively. Unlike conventional supervised learning, we focus on the case where (,) differs from (,). In unsupervised domain adaptation, we are given the following data:

  • Target unlabeled data .

  • Source labeled data ,

For simplicity, we denote the input features from the source domain as and the empirical distribution corresponding to (resp. ) as (resp. ).

We denote a loss function : . For example, the 0-1 loss is given as . The expected loss for functions and distribution over is denoted by . We also define an empirical risk as . In addition, we define the true risk minimizer and the empirical risk minimizer of a certain domain (,) in a hypothesis class as and , respectively. Here, note that the risk minimizer is not necessarily equal to the labeling function as we consider a restricted hypothesis class.

The goal in domain adaptation is to find a hypothesis out of a hypothesis class so that it gives as small expected loss as possible in the target domain given by

Related Work

In unsupervised domain adaptation, it is essential to measure the difference between the source and target domains, because it might degrade the performance to use a source domain far from the target domain (Pan and Yang, 2010). Since we cannot access the target labels, it is impossible to measure the difference between the two domains on the basis of the output space

. One way is to measure the difference between the probability distributions in terms of the

input space.

Existing Discrepancies in Domain Adaptation

First, we review the discrepancy proposed by Mansour, Mohri, and Rostamizadeh (2009a), which is defined as

(1)

which we call -disc as this discrepancy does not require source labels but input features in . Note that -disc takes the hypothesis class into account.

To illustrate the advantage of considering , we compare -disc to distance. Mansour, Mohri, and Rostamizadeh (2009a) showed that the following inequality holds for any :

where is a constant. Therefore, a tighter bound for the difference of the expected loss between the two domains can be achieved by considering a hypothesis class . One drawback of -disc is that it considers the worst pair of hypotheses as can be seen from (1). This may lead to a loose generalization error bound as we will show later, and also an intractable computational cost for an empirical estimation. Ben-David et al. (2007) provided a computationally efficient proxy of -disc for the 0-1 loss defined as follows:

Although can be computed efficiently, there is no learning guarantee.

There is another variant of -disc called generalized discrepancy (Cortes, Mohri, and Muñoz Medina, 2015). However, the theoretical analysis therein is only applicable to the regression task and not suitable for binary classification unlike the analysis for -disc.

Another discrepancy which also takes into account analyzed by Zhang, Zhang, and Ye (2012) is as follows:

While Mohri and Medina (2012) demonstrated -disc provides a tighter generalization error bound than -disc, it requires the labeling function that cannot be estimated in unsupervised domain adaptation in general.

Source-guided Discrepancy (-disc)

To mitigate the limitations of the existing discrepancies, we propose a novel discrepancy called source-guided discrepancy (-disc). In this paper, we show that -disc can provide a tighter generalization bound and can be computed efficiently compared with the existing discrepancies.

convergence rate
target generalization
error bound
computational complexity
-disc
N/A
-disc
-disc N/A N/A
Table 1: Comparison of -disc with the existing discrepancies that take into account for unsupervised domain adaptation in binary classification. We assume the hypothesis class satisfies the assumption in (4). Here, we consider the hinge loss for -disc, -disc, and . The computational complexity of -disc and are based on the empirical hinge loss minimization, which is solved with the kernel SVM by the SMO algorithm (Platt, 1999). The computational complexity of -disc is based on SDP relaxation by the ellipsoid method (Bubeck, 2015) given in the appendix. is not computable with unlabeled target data.
Definition 1 (source-guided discrepancy).

Let be a hypothesis class and let be a loss function. -disc between two distributions and is defined as

(2)

where is the risk minimizer with respect to the second distribution (here ). For example, for , and for .

-disc satisfies the following triangle inequality:

However, in general, -disc is not a distance as this discrepancy is not symmetric, i.e., , and we may have for .

This discrepancy has the following three advantages over the existing discrepancies. First, the computational cost of -disc is low since we do not need to consider a pair of hypotheses unlike -disc as we can see from (1) and (2). Second, our proposed -disc yields a tighter bound as discussed below. Mansour, Mohri, and Rostamizadeh (2009a) derived a generalization error bound by bounding the difference with -disc. On the other hand, for any , it is easy to see that

(3)

by the definition of -disc, which implies that -disc can give a tighter generalization error bound than -disc. Third, since can be estimated from the source labels, we can compute -disc without the target labels unlike . For these reasons, -disc is more suitable for unsupervised domain adaptation than the existing discrepancies. A comparison of -disc with the existing discrepancies is given in Table 1.

-disc Estimation for the 0-1 Loss

Here we consider the task of binary classification with the 0-1 loss, where the output space is . The following theorem states that the -disc estimation can be reduced to a cost-sensitive classification problem. We consider a symmetric hypothesis class , which is closed under the negation, i.e., for any , is contained in .

Theorem 2.

For the 0-1 loss and a symmetric hypothesis class , the following equality holds:

where is defined as

The proof of Theorem 2 is in the appendix. Theorem 2 suggests a three-step algorithm illustrated in Algorithm 1. The minimization of the 0-1 loss is computationally hard (Ben-David, Eiron, and Long, 2003; Feldman et al., 2012). Hence, we use surrogate losses such as the hinge loss  (Bartlett, Jordan, and McAuliffe, 2006). Note that the 0-1 loss is used for calculating -disc in the final step, i.e., , while a surrogate loss is used to learn a classifier.

labeled source data , unlabeled target data , surrogate loss , hypothesis class .
.
Source learning:
Learn a classifier using labeled source data .
Pseudo labeling:
  • ,

  • .

Cost sensitive learning from pseudo labeled data
Learn another classifier using and to minimize the surrogate cost-sensitive risk .
return = .
Algorithm 1 -disc Estimation for the 0-1 Loss

The hinge loss minimization with the SMO algorithm requires for the entire algorithm (Platt, 1999). This low computational cost has a big advantage over -disc (see Table 1). Clearly, is zero when using Algorithm 1. This is because the pseudo labeling step gives the opposite labels for the same supports of and , which makes it impossible to distinguish them. Thus, for every , , which leads to become zero.

Theoretical Analysis

In this section, we show that -disc can be estimated from finite data with consistency guarantee. After that, we show that our proposed -disc is useful for deriving a tighter generalization error bound than -disc. We also provide generalization error bounds. To derive theoretical results, the Rademacher complexity is used, which captures the complexity of a set of functions by measuring the capability of a hypothesis class to correlate with the random noise.

Definition 3 (Rademacher complexity).

Let be a set of real-valued functions defined over a set . Given a sample independently and identically drawn from a distribution , the Rademacher complexity of is defined as

where the inner expectation is taken over

which are mutually independent uniform random variables taking values in

.

Hereafter, we use the following notation:

  • ,

  • .

Consistency of -disc

In this section, we show the estimator converges to the true -disc as the numbers of samples and increase. The following theorem gives the deviation of the empirical -disc estimator, which is a general result that does not depend on the specific choice of the loss and the hypothesis class .

Theorem 4.

Assume the loss function is bounded from above by . For any , with probability at least ,

The proof of this theorem is given in the appendix.

Theorem 4 guarantees the consistency of under the condition that and are well-controlled.

To derive a specific convergence rate, we consider an assumption given by

(4)

for some constant depending only on the hypothesis class . Lemma 5 below shows that this assumption is naturally satisfied in the linear-in-parameter model.

Lemma 5.

Let be the linear-in-parameter model class, i.e., for fixed basis functions satisfying . Then,

The proof of Lemma 5 is given in the appendix. holds under the same assumptions as well. Subsequently, we derive the convergence rate bound for the 0-1 loss.

Corollary 6.

When we consider , it holds for any that, with probability at least ,

under the assumption in (4).

Proof.

It simply follows from Theorem 4, the assumption in (4), and the fact for any distribution and  (Mohri, Rostamizadeh, and Talwalkar, 2012, Lemma 3.1). ∎

From this corollary, we see that the empirical -disc has the consistency with convergence rate under a mild condition.

Generalization Error Bound

In the above section, we showed that -disc can be estimated by . In this section, we give two bounds on the generalization error for the target domain in terms of or .

The first bound shows the relationship between target risk and source risk .

Theorem 7.

Assume that obeys the triangle inequality, i.e., such as the 0-1 loss. Then, for any hypothesis ,

(5)
Proof.

Since

holds from the triangle inequality, we have

where the last inequality follows from the definition of -disc. ∎

The LHS of (5) represents the regret arising from the use of hypothesis instead of in the target domain. Theorem 7 shows that the regret is bounded by three terms: (a) the expected loss with respect to in the source domain, (b) the difference between and in the target domain, and (c) -disc between and . Note that if the source and target domains are sufficiently close, we can expect the second term and the third term to be small. This fact indicates that for an appropriate source domain, minimization of estimation error in the source domain leads to a better generalization in the target domain.

We can see an advantage of the generalization error bound based on -disc through comparison with the bound based on -disc (Mansour, Mohri, and Rostamizadeh, 2009a, Theorem 8) given by

(6)

The upper bound (6) using -disc has the same form as the upper bound (5) except for the term . Since -disc is never larger than -disc (see the inequality (3)), -disc gives a tighter bound than -disc.

The following theorem shows the generalization error bound for the finite sample case.

Theorem 8.

When we consider , for any and , with probability at least ,

The proof of this theorem is given in the appendix.

Theorem 8 tells us that when the following three terms are dominating in the bound of the regret in the target domain : (a) the empirical loss with respect to in the source domain, (b) the difference between and in the target domain, and (c) -disc between the two empirical distributions . Therefore, if is sufficiently close to , then to pick a good source in terms of the -disc estimator allows us to achieve a good target generalization.

Comparison with Existing Discrepancies

In this section, we showed the consistency of the estimator for -disc and derived a generalization error bound of -disc tighter than -disc. In this section, we first compare these theoretical guarantees of -disc with those for the existing ones in more detail. We next discuss the computational cost of them. This is also an important aspect when we apply these discrepancies to sentiment analysis (Bhatt, Rajkumar, and Roy, 2016), adversarial learning (Zhao et al., 2017), and computer vision (Saito et al., 2018) for source selection or reweighting of the source data. In fact, in these applications the discrepancy instead of -disc is used for ease of computation even though has no theoretical guarantee on the generalization error. The results of this section are summarized in Table 1.

Convergence Rates of Discrepancy Estimators

Here we discuss the consistency and convergence rates of the estimators of discrepancies.

The empirical estimator of is consistent, and its convergence rate is  (Ben-David et al., 2010, Lemma 1). This rate is slower than the rate for -disc with appropriately controlled Rademacher complexities and of the hypothesis class, such as the linear-in-parameter model (Mohri, Rostamizadeh, and Talwalkar, 2012, Theorem 4.3). Here recall that no generalization error bound is known in terms of even though itself is consistently estimated.

On the other hand, the empirical estimator of -disc is shown to be consistent  (Mansour, Mohri, and Rostamizadeh, 2009a, Corollary 7) and its convergence rate is in the case that the loss function is loss, i.e., . Thus, the derived rate is the same as -disc whereas the requirement on the loss function is slightly stronger than the one for -disc in Theorem 4.

Note that the above difference of the theoretical guarantees does not come from the inherent difference of these estimators. This is because we adopted the analysis based on the Rademacher complexity, which has not been well studied in the context of unsupervised domain adaptation. This is a distribution-dependent complexity measure and less pessimistic compared with the VC-dimension used in the previous work (Ben-David et al., 2010). In fact, the known guarantees on and -disc can be improved to Propositions 9 and 10 given below.

Proposition 9.

For any , with probability at least ,

Proposition 10.

Assume the loss function is upper bounded by . For any , with probability at least ,

The proofs of these theorems are quite similar to that of Theorem 4 and omitted.

In summary, under the mild assumptions, the convergence rates of the empirical estimators of and -disc are as well as -disc.

Computational Complexity

Computation of can be done by the empirical risk minimization (Ben-David et al., 2010, Lemma 2). The original form is given with the 0-1 loss, which can be efficiently minimized with a surrogate loss. When the hinge loss is applied, the minimization can be carried out with the computational cost by the SMO algorithm (Platt, 1999).

On the other hand, no efficient algorithm is given for the computation of -disc in the classification setting.111A computation algorithm of -disc for the 0-1 loss is given only in the one-dimensional case (Mansour, Mohri, and Rostamizadeh, 2009a, Section 5.2). For a fair comparison, we give a relatively efficient algorithm to compute -disc (1) in the classification setting with the hinge loss, based on semidefinite relaxation. Unfortunately, the computational complexity of the relaxed algorithm is still , which is prohibitive compared with the computation of -disc and .

Experiments

Illustration

To clarify the advantage of the -disc estimator in binary classification, we compared -disc with in the toy experiment.

We generated data points per class for each of two sources and and target domain as , , where

In this experiment, we used the SVM with a linear kernel222We use scikit-learn (Pedregosa et al., 2011) to implement it.. For these data, we obtained the following results:

These values indicate that while is the better source in terms of , is the better source in terms of -disc for target domain . In this example, the loss calculated on of the classifier trained on is 0.0 and the loss of the classifier trained on is 0.49. This implies that -disc is the better discrepancy to measure the quality of source domains for a better generalization in a given target domain.

Figure 1: 2D plots of three domains.

Comparison of Computational Time

Next, we compared computational time to estimate -disc, , and -disc. We used 2-dimensional 30 examples for both source and target domains. For the computation of -disc, we used the relaxed algorithm333We used picos (https://picos-api.gitlab.io/picos/) for the SDP solver with cvxopt (https://cvxopt.org/) backend. shown in (11) in the appendix. The computations were carried out on a 2.5GHz Intel® Core i5. The results shown in Figure 2 demonstrate that the computational time of -disc is prohibitive, as suggested in Table 1. Ultimately, -disc estimation is as efficient as .

Figure 2: Comparison of computational time.

Empirical Convergence

In this experiment, we compared the empirical convergence of -disc and

based on logistic regression implemented with scikit-learn with default parameters 

(Pedregosa et al., 2011). Note that -disc is not used due to computational intractability. Here, we used MNIST (LeCun, Cortes, and Burges, 2010)

dataset, and the binary classification task is used to separate odd and even digits. We defined the source and target domains as follows:

  • Source domain : MNIST

  • Source domain : MNIST from zero to seven

  • Target domain : MNIST

The number of examples was ranged from {1000, 2000, …, 20000} for each domain. Note that while examples from the source domain is drawn from the same distribution of the target domain , the sample of the source domain is affected by sample selection bias (Cortes et al., 2008). Figure 3 shows the empirical convergence of the estimator of both discrepancies. It is observed that both discrepancies indicate that is a better source than for target domain . However, since the true value of the discrepancy between and is supposed to be zero, we can observe that the will be converged much more slowly than -disc.

Figure 3: Empirical convergence of and -disc.
Figure 4: Source selection performance with varying noise rates. cannot distinguish between clean and noisy sources for all noise rates. Blue dotted line denotes the maximum score (five).

Source Selection

Here, we compared the performance in the source selection task between -disc and . We defined the source domains and the target domain as follows:

  • Clean source domains: Five grayscale MNIST-M

  • Noisy source domains: Five grayscale MNIST-M corrupted by Gaussian random noise.

  • Target domain: MNIST

The MNIST-M dataset is known to be useful for the domain adaptation task when the target domain is MNIST (Ganin et al., 2016). The task of each domain is to classify between even and odd digits and logistic regression with default parameters was used for computing -disc and . The objective of this source selection task is to correctly rank five clean source domains over noisy domains. We ranked each source by computing the difference between the target domain and each source and rank them in ascending order. The score was calculated by counting how many clean sources are ranked in the first five sources. We varied the number of examples from {200, 400, …, 4000} for each domain. The Gaussian noise with were added and clipped to force the value to between . For each number of examples per class, the experiments were conducted times and the average score was used. Figure 4 shows the performance of each discrepancy with different noise rates. As the number of examples increased, -disc achieved a better performance. In contrast, cannot distinguish between noisy and clean sources. In fact, always returned one, which indicates that MNIST-M is unrelated to MNIST. Unlike the previous experiments, the difference between the two domains is harder to identify since the source domains and the target domain are not exactly the same domain (MNIST vs MNIST-M). As a result, this experiment demonstrates the failure of in the source selection task and suggests the practical advantage of -disc.

Conclusion

We proposed the novel discrepancy for unsupervised domain adaptation called source-guided discrepancy (-disc). We provided the computationally efficient algorithm for the estimation of -disc with respect to the 0-1 loss, and also derived the consistency of the estimator with the convergence rate. Moreover, we derived the generalization bound based on -disc, which is tighter than that of -disc. Finally, we demonstrated the advantages of -disc over the other discrepancies through experiments.

Acknowledgements

We thank Ikko Yamane, Futoshi Futami and Kento Nozawa for the useful discussion. NC was supported by MEXT scholarship. MS was supported by KAKENHI 17H01760.

References

  • Bartlett, Jordan, and McAuliffe (2006) Bartlett, P. L.; Jordan, M. I.; and McAuliffe, J. D. 2006. Convexity, classification, and risk bounds. Journal of the American Statistical Association 101(473):138–156.
  • Ben-David et al. (2007) Ben-David, S.; Blitzer, J.; Crammer, K.; and Pereira, F. 2007. Analysis of representations for domain adaptation. In NIPS, 137–144.
  • Ben-David et al. (2010) Ben-David, S.; Blitzer, J.; Crammer, K.; Kulesza, A.; Pereira, F.; and Vaughan, J. W. 2010. A theory of learning from different domains. Machine Learning 79(1-2):151–175.
  • Ben-David, Eiron, and Long (2003) Ben-David, S.; Eiron, N.; and Long, P. M. 2003. On the difficulty of approximately maximizing agreements. Journal of Computer and System Sciences 66(3):496–514.
  • Bhatt, Rajkumar, and Roy (2016) Bhatt, H. S.; Rajkumar, A.; and Roy, S. 2016. Multi-source iterative adaptation for cross-domain classification. In IJCAI, 3691–3697.
  • Bubeck (2015) Bubeck, S. 2015. Convex Optimization: Algorithms and Complexity, volume 8. Now Publishers, Inc.
  • Cortes et al. (2008) Cortes, C.; Mohri, M.; Riley, M.; and Rostamizadeh, A. 2008. Sample selection bias correction theory. In ALT, 38–53.
  • Cortes, Mohri, and Muñoz Medina (2015) Cortes, C.; Mohri, M.; and Muñoz Medina, A. 2015. Adaptation algorithm and theory based on generalized discrepancy. In SIGKDD, 169–178.
  • Courty et al. (2017) Courty, N.; Flamary, R.; Habrard, A.; and Rakotomamonjy, A. 2017. Joint distribution optimal transportation for domain adaptation. In NIPS, 3730–3739.
  • Feldman et al. (2012) Feldman, V.; Guruswami, V.; Raghavendra, P.; and Wu, Y. 2012. Agnostic learning of monomials by halfspaces is hard. SIAM Journal on Computing 41(6):1558–1590.
  • Fujie and Kojima (1997) Fujie, T., and Kojima, M. 1997. Semidefinite programming relaxation for nonconvex quadratic programs. Journal of Global Optimization 10(4):367–380.
  • Ganin et al. (2016) Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; and Lempitsky, V. 2016.

    Domain-adversarial training of neural networks.

    The Journal of Machine Learning Research 17(1):2096–2030.
  • Glorot, Bordes, and Bengio (2011) Glorot, X.; Bordes, A.; and Bengio, Y. 2011.

    Domain adaptation for large-scale sentiment classification: A deep learning approach.

    In ICML, 513–520.
  • Huang et al. (2007) Huang, J.; Gretton, A.; Borgwardt, K. M.; Schölkopf, B.; and Smola, A. J. 2007. Correcting sample selection bias by unlabeled data. In NIPS, 601–608.
  • Jiang and Zhai (2007) Jiang, J., and Zhai, C. 2007. Instance weighting for domain adaptation in NLP. In ACL, 264–271.
  • Kim and Kojima (2001) Kim, S., and Kojima, M. 2001. Second order cone programming relaxation of nonconvex quadratic optimization problems. Optimization methods and software 15(3-4):201–224.
  • LeCun, Cortes, and Burges (2010) LeCun, Y.; Cortes, C.; and Burges, C. 2010. MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/.
  • Mansour, Mohri, and Rostamizadeh (2009a) Mansour, Y.; Mohri, M.; and Rostamizadeh, A. 2009a. Domain adaptation: Learning bounds and algorithms. In COLT.
  • Mansour, Mohri, and Rostamizadeh (2009b) Mansour, Y.; Mohri, M.; and Rostamizadeh, A. 2009b. Multiple source adaptation and the Rényi divergence. In UAI, 367–374.
  • McDiarmid (1989) McDiarmid, C. 1989. On the method of bounded differences. In Surveys in Combinatorics, 148–188.
  • Mohri and Medina (2012) Mohri, M., and Medina, A. M. 2012. New analysis and algorithm for learning with drifting distributions. In ALT, 124–138.
  • Mohri, Rostamizadeh, and Talwalkar (2012) Mohri, M.; Rostamizadeh, A.; and Talwalkar, A. 2012. Foundations of Machine Learning. MIT Press.
  • Pan and Yang (2010) Pan, S. J., and Yang, Q. 2010.

    A survey on transfer learning.

    IEEE Transactions on Knowledge and Data Engineering 22(10):1345–1359.
  • Pedregosa et al. (2011) Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. 2011. Scikit-learn: Machine learning in python. Journal of Machine Learning Research 12:2825–2830.
  • Platt (1999) Platt, J. C. 1999.

    Sequential minimal optimization: A fast algorithm for training support vector machines.

    In Advances in Kernel Methods: Support Vector Learning. MIT Press. 185–208.
  • Saito et al. (2018) Saito, K.; Watanabe, K.; Ushiku, Y.; and Harada, T. 2018. Maximum classifier discrepancy for unsupervised domain adaptation. In CVPR.
  • Saito, Ushiku, and Harada (2017) Saito, K.; Ushiku, Y.; and Harada, T. 2017. Asymmetric tri-training for unsupervised domain adaptation. In ICML, 2988–2997.
  • Sugiyama et al. (2008) Sugiyama, M.; Suzuki, T.; Nakajima, S.; Kashima, H.; von Bünau, P.; and Kawanabe, M. 2008. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics 60(4):699–746.
  • Sun et al. (2017) Sun, S.; Zhang, B.; Xie, L.; and Zhang, Y. 2017. An unsupervised deep domain adaptation approach for robust speech recognition. Neurocomputing 257:79–87.
  • Zhang, Zhang, and Ye (2012) Zhang, C.; Zhang, L.; and Ye, J. 2012. Generalization bounds for domain adaptation. In NIPS, 3320–3328.
  • Zhao et al. (2017) Zhao, H.; Zhang, S.; Wu, G.; Costeira, J. P.; Moura, J. M. F.; and Gordon, G. J. 2017. Multiple source domain adaptation with adversarial training of neural networks. CoRR abs/1705.09684.

Appendix A Proof of Theorem 2

Proof.

First, the following equality holds:

And,

Then, for a symmetric hypothesis class , the following equality holds:

Next, by the definition of , we have