DeepAI
Log In Sign Up

PAC-Bayes and Domain Adaptation

07/17/2017
by   Pascal Germain, et al.
0

We provide two main contributions in PAC-Bayesian theory for domain adaptation where the objective is to learn, from a source distribution, a well-performing majority vote on a different, but related, target distribution. Firstly, we propose an improvement of the previous approach we proposed in Germain et al. (2013), which relies on a novel distribution pseudodistance based on a disagreement averaging, allowing us to derive a new tighter domain adaptation bound for the target risk. While this bound stands in the spirit of common domain adaptation works, we derive a second bound (recently introduced in Germain et al., 2016) that brings a new perspective on domain adaptation by deriving an upper bound on the target risk where the distributions' divergence-expressed as a ratio-controls the trade-off between a source error measure and the target voters' disagreement. We discuss and compare both results, from which we obtain PAC-Bayesian generalization bounds. Furthermore, from the PAC-Bayesian specialization to linear classifiers, we infer two learning algorithms, and we evaluate them on real data.

READ FULL TEXT VIEW PDF
06/15/2015

A New PAC-Bayesian Perspective on Domain Adaptation

We study the issue of PAC-Bayesian domain adaptation: We want to learn, ...
03/24/2015

PAC-Bayesian Theorems for Domain Adaptation with Specialization to Linear Classifiers

In this paper, we provide two main contributions in PAC-Bayesian theory ...
01/13/2015

An Improvement to the Domain Adaptation Bound in a PAC-Bayesian context

This paper provides a theoretical analysis of domain adaptation based on...
10/01/2014

Domain adaptation of weighted majority votes via perturbed variation-based self-labeling

In machine learning, the domain adaptation problem arrives when the test...
12/11/2012

PAC-Bayesian Learning and Domain Adaptation

In machine learning, Domain Adaptation (DA) arises when the distribution...
10/03/2019

A General Upper Bound for Unsupervised Domain Adaptation

In this work, we present a novel upper bound of target error to address ...
02/07/2021

Domain Adversarial Neural Networks for Domain Generalization: When It Works and How to Improve

Theoretically, domain adaptation is a well-researched problem. Further, ...

1 Introduction

As human beings, we learn from what we saw before. Think about our education process: When a student attends to a new course, the knowledge he has acquired from previous courses helps him to understand the current one. However, traditional machine learning approaches assume that the learning and test data are drawn from the same probability distribution. This assumption may be too strong for a lot of real-world tasks, in particular those where we desire to reuse a model from one task to another one. For instance, a spam filtering system suitable for one user can be poorly adapted to another who receives significantly different emails. In other words, the learning data associated with one or several users could be unrepresentative of the test data coming from another one. This enhances the need to design methods for adapting a classifier from learning (source) data to test (target) data. One solution to tackle this issue is to consider the

domain adaptation framework111The reader can refer to the surveys proposed in (JiangSurvey08; Quionero-Candela:2009; Margolis2011; wang2018deep) (domain adaptation is often associated with transfer learning (Pan-TL-TKDE09))., which arises when the distribution generating the target data (the target domain) differs from the one generating the source data (the source domain). Note that, it is well known that domain adaptation is a hard and challenging task even under strong assumptions (David-AISTAT10; BenDavid12; Ben-DavidU14).

Many approaches exist in the literature to address domain adaptation, often with the same underlying idea: If we are able to apply a transformation in order to “move closer” the distributions, then we can learn a model with the available labels. This process can be performed by reweighting the importance of labeled data (HuangSGBS-nips06; Sugiyama-NIPS07; CortesMM10; cortes-15). This is one of the most popular methods when one wants to deal with the covariate-shift issue (e.g., HuangSGBS-nips06; sugiyama2008direct), where source and target domains diverge only in their marginals, i.e., when they share the same labeling function. Another technique is to exploit self-labeling procedures, where the objective is to transfer the source labels to the target unlabeled points (e.g., BruzzoneM10S; habrard2013iterative; PVMinCq)

. A third solution is to learn a new common representation space from the unlabeled part of source and target data. Then, a standard supervised learning algorithm can be run on the source labeled instances 

(e.g., glorot2011domain; Chen12; courty2016optimal; ganin-16; CourtyFTR17). A slightly different approach, known as hypothesis transfer learning, aims at directly transferring the learned source model to the target domain kuzborskij2013stability; ThesisKuzborskij2018theory.

The work presented in this paper stands into a fifth popular class of approaches, which has been especially explored to derive generalization bounds for domain adaptation. This kind of approaches relies on the control of a measure of divergence/distance between the source distribution and target distribution (e.g. BenDavid-NIPS06; BenDavid-MLJ2010; li2007bayesian; Zhang12; morvant12; CortesM14; redko2017theoretical). Such a distance usually depends on the set of hypotheses considered by the learning algorithm. The intuition is that one must look for a set that minimizes the distance between the distributions while preserving good performances on the source data; if the distributions are close under this measure, then generalization ability may be “easier” to quantify. In fact, defining such a measure to quantify how much the domains are related is a major issue in domain adaptation. For example, for binary classification with the

-loss function,

BenDavid-MLJ2010 BenDavid-MLJ2010; BenDavid-NIPS06 have considered the -divergence between the source and target marginal distributions. This quantity depends on the maximal disagreement between two classifiers, and allowed them to deduce a domain adaptation generalization bound based on the VC-dimension theory. The discrepancy distance proposed by Mansour-COLT09 generalizes this divergence to real-valued functions and more general losses, and is used to obtain a generalization bound based on the Rademacher complexity. In this context, CortesM11; CortesM14 have specialized the minimization of the discrepancy to regression with kernels. In these situations, domain adaptation can be viewed as a multiple trade-off between the complexity of the hypothesis class , the adaptation ability of  according to the divergence between the marginals, and the empirical source risk. Moreover, other measures have been exploited under different assumptions, such as the Rényi divergence suitable for importance weighting (MansourMR09), or the measure proposed by Zhang12 which takes into account the source and target true labeling, or the Bayesian divergence prior (li2007bayesian) which favors classifiers closer to the best source model, or the Wassertein distance redko2017theoretical that justifies the usefulness of optimal transport strategy in domain adaptation courty2016optimal; CourtyFTR17. However, a majority of methods prefer to perform a two-step approach: (i) First construct a suitable representation by minimizing the divergence, then (ii) learn a model on the source domain in the new representation space.

Given the multitude of concurrent approaches for domain adaptation, and the nonexistence of a predominant one, we believe that the problem still needs to be studied from different perspectives for a global comprehension to emerge. We aim to contribute to this study from a PAC-Bayesian standpoint. One particularity of the PAC-Bayesian theory (first set out by Mcallester99a) is that it focuses on algorithms that output a posterior distribution over a classifier set  (i.e., a -average over ) rather than just a single predictor (as in BenDavid-NIPS06, and other works cited above). More specifically, we tackle the unsupervised domain adaptation setting for binary classification, where no target labels are provided to the learner. We propose two domain adaptation analyses, both introduced separately in previous conference papers (pbda; dalc). We refine these results, and provide in-depth comparison, full proofs and technical details. Our analyses highlight different angles that one can adopt when studying domain adaptation.

Our first approach follows the philosophy of the seminal work of BenDavid-MLJ2010 BenDavid-MLJ2010; BenDavid-NIPS06 and MansourMR09

: The risk of the target model is upper-bounded jointly by the model’s risk on the source distribution, a divergence between the marginal distributions, and a non-estimable term

222More precisely, this term can only be estimated in the presence of labeled data from both the source and the target domains. related to the ability to adapt in the current space. To obtain such a result, we define a pseudometric which is ideal for the PAC-Bayesian setting by evaluating the domains’ divergence according to the -average disagreement of the classifiers over the domains. Additionally, we prove that this domains’ divergence is always lower than the popular -divergence, and is easily estimable from samples. Note that, based on this disagreement measure, we derived in a previous work (pbda) a first PAC-Bayesian domain adaptation bound expressed as a -averaging. We provide here a new version of this result, that does not change the underlying philosophy supported by the previous bound, but clearly improves the theoretical result: The domain adaptation bound is now tighter and easier to interpret.

Our second analysis (introduced in dalc) consists in a target risk bound that brings an original way to think about domain adaptation problems. Concretely, the risk of the target model is still upper-bounded by three terms, but they differ in the information they capture. The first term is estimable from unlabeled data and relies on the disagreement of the classifiers only on the target domain. The second term depends on the expected accuracy of the classifiers on the source domain. Interestingly, this latter is weighted by a divergence between the source and the target domains that enables controlling the relationship between domains. The third term estimates the “volume” of the target domain living apart from the source one333Here we do not focus on learning a new representation to help the adaptation: We directly aim at adapting in the current representation space., which has to be small for ensuring adaptation.

Thanks to these results, we derive PAC-Bayesian generalization bounds for our two domain adaptation bounds. Then, in contrast to the majority of methods that perform a two-step procedure, we design two algorithms tailored to linear classifiers, called pbda and dalc, which jointly minimize the multiple trade-offs implied by the bounds. On the one hand, pbda is inspired by our first analysis for which the first two quantities being, as usual in the PAC-Bayesian approach, the complexity of the

-weighted majority vote measured by a Kullback-Leibler divergence and the empirical risk measured by the

-average errors on the source sample. The third quantity corresponds to our domains’ divergence and assesses the capacity of the posterior distribution to distinguish some structural difference between the source and target samples. On the other hand, dalc is inspired by our second analysis from which we deduce that a good adaptation strategy consists in finding a -weighted majority vote leading to a suitable trade-off—controlled by the domains’ divergence—between the first two terms (and the usual Kullback-Leibler divergence): Minimizing the first one corresponds to look for classifiers that disagree on the target domain, and minimizing the second one to seek accurate classifiers on the source.

The rest of the paper is structured as follows. Section 2 deals with two seminal works on domain adaptation. The PAC-Bayesian framework is then recalled in Section 3. Note that for the sake of completeness, we provide for the first time the explicit derivation of the algorithm PBGD3 (germain2009pac) tailored to linear classifiers in supervised learning. Our main contribution, which consists in two domain adaptation bounds suitable for PAC-Bayesian learning, is presented in Section 4, the associated generalization bounds are derived in Section 5. Then, we design our new algorithms for PAC-Bayesian domain adaptation in Section 6, that we experiment in Section 7. We conclude in Section 8.

2 Domain Adaptation Related Works

In this section, we review the two seminal works in domain adaptation that are based on a divergence measure between the domains BenDavid-MLJ2010; BenDavid-NIPS06; Mansour-COLT09.

2.1 Notations and Setting

We consider domain adaptation for binary classification tasks where is the input space of dimension , and is the output/label set. The source domain and the target domain are two different distributions (unknown and fixed) over , and being the respective marginal distributions over . We tackle the challenging task where we have no target labels, known as unsupervised domain adaptation. A learning algorithm is then provided with a labeled source sample consisting of examples drawn i.i.d.444i.i.d. stands for independent and identically distributed. from , and an unlabeled target sample consisting of examples drawn i.i.d. from . We denote the distribution of a -sample by . We suppose that is a set of hypothesis functions for to . The expected source error and the expected target error of over , respectively , are the probability that  errs on the entire distribution , respectively ,

where is the -loss function which returns if and otherwise. The empirical source error of on the learning source sample  is

The main objective in domain adaptation is then to learn—without target labels—a classifier leading to the lowest expected target error .

Given two classifiers , we also introduce the notion of expected source disagreement and the expected target disagreement , which measure the probability that and do not agree on the respective marginal distributions, and are defined by

The empirical source disagreement on and the empirical target disagreements on are

Note that, depending on the context, denotes either the source labeled sample or its unlabeled part . We can remark that the expected error on a distribution can be viewed as a shortcut notation for the expected disagreement between a hypothesis and a labeling function that assigns the true label to an example description with respect to . We have

2.2 Necessity of a Domains’ Divergence

The domain adaptation objective is to find a low-error target hypothesis, even if the target labels are not available. Even under strong assumptions, this task can be impossible to solve (David-AISTAT10; BenDavid12; Ben-DavidU14). However, for deriving generalization ability in a domain adaptation situation (with the help of a domain adaptation bound), it is critical to make use of a divergence between the source and the target domains: The more similar the domains, the easier the adaptation appears. Some previous works have proposed different quantities to estimate how a domain is close to another one (BenDavid-NIPS06; li2007bayesian; Mansour-COLT09; MansourMR09; BenDavid-MLJ2010; Zhang12). Concretely, two domains and differ if their marginals and are different, or if the source labeling function differs from the target one, or if both happen. This suggests taking into account two divergences: One between and , and one between the labeling. If we have some target labels, we can combine the two distances as done by Zhang12. Otherwise, we preferably consider two separate measures, since it is impossible to estimate the best target hypothesis in such a situation. Usually, we suppose that the source labeling function is somehow related to the target one, then we look for a representation where the marginals and appear closer without losing performances on the source domain.

2.3 Domain Adaptation Bounds for Binary Classification

We now review the first two seminal works which propose domain adaptation bounds based on a divergence between the two domains.

First, under the assumption that there exists a hypothesis in that performs well on both the source and the target domain, BenDavid-MLJ2010 BenDavid-NIPS06; BenDavid-MLJ2010 have provided the following domain adaptation bound.

Theorem 1 (BenDavid-MLJ2010; BenDavid-NIPS06)

Let be a (symmetric555In a symmetric hypothesis space , for every , its inverse is also in .) hypothesis class. We have

(1)

where

is the -distance between marginals and , and is the error of the best hypothesis overall

This bound relies on three terms. The first term is the classical source domain expected error. The second term depends on and corresponds to the maximum deviation between the source and target disagreement between two hypotheses of . In other words, it quantifies how hypothesis from can “detect” differences between these marginals: The lower this measure is for a given , the better are the generalization guarantees. The last term is related to the best hypothesis over the domains and acts as a quality measure of in terms of labeling information. If does not have a good performance on both the source and the target domain, then there is no way one can adapt from this source to this target. Hence, as pointed out by the authors, Equation (1) expresses a multiple trade-off between the accuracy of some particular hypothesis , the complexity of  (quantified in BenDavid-MLJ2010 with the usual VC-bound theory), and the “incapacity” of hypotheses of to detect difference between the source and the target domain.

Second, Mansour-COLT09 have extended the -distance to the discrepancy divergence for regression and any symmetric loss fulfilling the triangle inequality. Given such a loss, the discrepancy between and  is

Note that with the -loss in binary classification, we have

Even if these two divergences may coincide, the following domain adaptation bound of Mansour-COLT09 differs from Theorem 1.

Theorem 2 (Mansour-COLT09)

Let be a (symmetric) hypothesis class. We have

(2)

with the disagreement between the ideal hypothesis on the target and source domains: and

Equation (2) can be tighter than Equation (1)666Equation (1) can lead to an error term three times higher than Equation (2) in some cases (more details in Mansour-COLT09). since it bounds the difference between the target error of a classifier and the one of the optimal . Based on Theorem 2 and a Rademacher complexity analysis, Mansour-COLT09 provide a generalization bound on the target risk, that expresses a trade-off between the disagreement (between and the best source hypothesis ), the complexity of , and—again—the “incapacity” of hypotheses to detect differences between the domains.

To conclude, the domain adaptation bounds of Theorems 1 and 2 suggest that if the divergence between the domains is low, a low-error classifier over the source domain might perform well on the target one. These divergences compute the worst case of the disagreement between a pair of hypothesis. We propose in Section 4 two average case approaches by making use of the essence of the PAC-Bayesian theory, which is known to offer tight generalization bounds (Mcallester99a; germain2009pac; Parrado-Hernandez12). Our first approach (see Section 4.1) stands in the philosophy of these seminal works, and the second one (see Section 4.2) brings a different and novel point of view by taking advantages of the PAC-Bayesian framework we recall in the next section.

3 PAC-Bayesian Theory in Supervised Learning

Let us now review the classical supervised binary classification framework called the PAC-Bayesian theory, first introduced by Mcallester99a. This theory succeeds to provide tight generalization guarantees—without relying on any validation set—on weighted majority votes, i.e., for ensemble methods (dietterich2000ensemble; re2012ensemble) where several classifiers (or voters) are assigned a specific weight. Throughout this section, we adopt an algorithm design perspective. Indeed, the PAC-Bayesian analysis of domain adaptation provided in the forthcoming sections is oriented by the motivation of creating new adaptive algorithms.

3.1 Notations and Setting

Traditionally, PAC-Bayesian theory considers weighted majority votes over a set of binary hypothesis, often called voters. Let be a fixed yet unknown distribution over , and  be a learning set where each example are drawn i.i.d. from . Then, given a prior distribution  over (independent from the learning set ), the “PAC-Bayesian” learner aims at finding a posterior distribution  over leading to a -weighted majority vote (also called the Bayes classifier) with good generalization guarantees and defined by

However, minimizing the risk of , defined as

is known to be NP-hard. To tackle this issue, the PAC-Bayesian approach deals with the risk of the stochastic Gibbs classifier associated with and closely related to . In order to predict the label of an example , the Gibbs classifier first draws a hypothesis from according to , then returns as label. Then, the error of the Gibbs classifier on a domain corresponds to the expectation of the errors over :

(3)

In this setting, if misclassifies , then at least half of the classifiers (under ) errs on . Hence, we have

Another result on the relation between and is the -bound of Lacasse07 expressed as

(4)

where corresponds to the expected disagreement of the classifiers over :

(5)

Equation (4) suggests that for a fixed numerator, i.e., a fixed risk of the Gibbs classifier, the best -weighted majority vote is the one associated with the lowest denominator, i.e., with the greatest disagreement between its voters (for further analysis, see graal-neverending).

We now introduce the notion of expected joint error of a pair of classifiers drawn according to the distribution , defined as

(6)

From the definitions of the expected disagreement and the joint error, Lacasse07; graal-neverending observed that, given a domain  on and a distribution on , we can decompose the Gibbs risk as

(7)

Indeed, we have

Lastly, PAC-Bayesian theory allows one to bound the expected error in terms of two major quantities: The empirical error

estimated on a sample , and the Kullback-Leibler divergence

We present in the next section the PAC-Bayesian theorem proposed by catoni2007pac.777Two other common forms of the PAC-Bayesian theorem are the one of Mcallester99a and the one of Seeger02; Langford05. We refer the reader to our research report (pbda_long) for a larger variety of PAC-Bayesian theorems in a domain adaptation context.

3.2 A Usual PAC-Bayesian Theorem

Usual PAC-Bayesian theorems suggest that, in order to minimize the expected risk, a learning algorithm should perform a trade-off between the empirical risk minimization and KL-divergence minimization (roughly speaking the complexity term). The nature of this trade-off can be explicitly controlled in Theorem 3 below. This PAC-Bayesian result, first proposed by catoni2007pac

, is defined with a hyperparameter (here named

). It appears to be a natural tool to design PAC-Bayesian algorithms. We present this result in the simplified form suggested by germain09b.

Theorem 3 (catoni2007pac)

For any domain over , for any set of hypotheses , any prior distribution over , any , and any real number , with a probability at least over the random choice of , for every posterior distribution on , we have

(8)

Similarly to mcallester-keshet-11, we could choose to restrict to obtain a slightly looser but simpler bound. Using to upper-bound on the right-hand side of Equation (8), we obtain

(9)

The bound of Theorem 3—in both forms of Equations (8) and (9)—has two appealing characteristics. First, choosing , the bound becomes consistent: It converges to as  grows. Second, as described in Section 3.3

, its minimization is closely related to the minimization problem associated with the Support Vector Machine (

svm) algorithm when is an isotropic Gaussian over the space of linear classifiers  (germain2009pac). Hence, the value allows us to control the trade-off between the empirical risk and the “complexity term” .

3.3 Supervised PAC-Bayesian Learning of Linear Classifiers

Let us consider as a set of linear classifiers in a -dimensional space. Each is defined by a weight vector :

where denotes the dot product.

By restricting the prior and the posterior distributions over

to be Gaussian distributions,

Langford02 have specialized the PAC-Bayesian theory in order to bound the expected risk of any linear classifier . More precisely, given a prior and a posterior defined as spherical Gaussians with identity covariance matrix respectively centered on vectors and , for any , we have

An interesting property of these distributions—also seen as multivariate normal distributions,

and —is that the prediction of the -weighted majority vote coincides with the one of the linear classifier . Indeed, we have

Moreover, the expected risk of the Gibbs classifier on a domain is then given by888The calculations leading to Equation (10) can be found in Langford05. For sake of completeness, we provide a slightly different derivation in B.

(10)

where

(11)

with Erf is the Gauss error function defined as

(12)

Here, can be seen as a smooth surrogate of the 0-1-loss function relying on . This function is sometimes called the probit–loss (e.g., mcallester-keshet-11). It is worth noting that plays an important role on the value of , but not on . Indeed, tends to as grows, which can provide very tight bounds (see the empirical analyses of AmbroladzePS06; germain2009pac). Finally, the KL-divergence between and becomes simply

and turns out to be a measure of complexity of the learned classifier.

3.3.1 Objective Function and Gradient

Based on the specialization of the PAC-Bayesian theory to linear classifiers, germain2009pac suggested minimizing a PAC-Bayesian bound on . For sake of completeness, we provide here more mathematical details than in the original conference paper (germain2009pac). In forthcoming Section 6, we will extend this supervised learning algorithm to the domain adaptation setting.

Given a sample and a hyperparameter , the learning algorithm performs gradient descent in order to find an optimal weight vector that minimizes

(13)

It turns out that the optimal vector corresponds to the distribution minimizing the value of the bound on given by Theorem 3, with the parameter of the theorem being the hyperparameter of the learning algorithm. It is important to point out that PAC-Bayesian theorems bound simultaneously for every on . Therefore, one can “freely” explore the domain of objective function to choose a posterior distribution  that gives, thanks to Theorem 3, a bound valid with probability .

The minimization of Equation (13) by gradient descent corresponds to the learning algorithm called PBGD3 of germain2009pac. The gradient of is given the vector :

where is the derivative of at point .

Similarly to SVM, the learning algorithm PBGD3 realizes a trade-off between the empirical risk—expressed by the loss —and the complexity of the learned linear classifier—expressed by the regularizer . This similarity increases when we use a kernel function, as described next.

3.3.2 Using a Kernel Function

The kernel trick allows to substitute inner products by a kernel function in Equation (13). If is a Mercer kernel, it implicitly represents a function that maps an example of into an arbitrary -dimensional space, such that

Then, a dual weight vector encodes the linear classifier as a linear combination of examples of :

By the representer theorem (scholkopf-01), the vector minimizing Equation (13) can be recovered by finding the vector that minimizes

(14)

where is the kernel matrix of size  .999It is non-trivial to show that the kernel trick holds when and are Gaussian over infinite-dimensional feature space. As mentioned by mcallester-keshet-11, it is, however, the case provided we consider Gaussian processes as measure of distributions and over (infinite) . The same analysis holds for the kernelized versions of the two forthcoming domain adaptation algorithms (Section 6.3.3). That is, The gradient of is simply given the vector , with

for

3.3.3 Improving the Algorithm Using a Convex Objective

An annoying drawback of PBGD3 is that the objective function is non-convex and the gradient descent implementation needs many random restarts. In fact, we made extensive empirical experiments after the ones described by germain2009pac and saw that PBGD3 achieves an equivalent accuracy (and at a fraction of the running time) by replacing the loss function of Equations (13) and (14) by its convex relaxation, which is

(15)

The derivative of at point is then , in other words if , and otherwise. Figure (a)a illustrates the functions and  . Note that the latter can be interpreted as a smooth version the svm’s hinge loss, . The toy experiment of Figure (d)d (described in the next subsection) provides another empirical evidence that the minima of and tend to coincide.

(a) Loss functions for linear classifiers.
function derivative
.    .   
(b) Loss functions definitions and their derivatives.

(c) Toy dataset, and the decision boundary for (matching the vertical line of Figure (d)).

(d) Risk values according to , for . Each dashed line shows convex counterpart of the continuous line of the same color.
Figure 1: Understanding PBGD3 supervised learning algorithm in terms of loss functions. Upper Figures (a-b) show the loss functions, and lower Figures (c-d) illustrate the behavior on a toy dataset.

3.3.4 Illustration on a Toy Dataset

To illustrate the trade-off coming into play in PBGD3 algorithm (and its convexified version), we conduct a small experiment on a two-dimensional toy dataset. That is, we generate positive examples according a Gaussian of mean and negative examples generated by a Gaussian of mean

(both of these Gaussian have a unit variance), as shown by Figure 

(c)c. We then compute the risks associated with linear classifiers , with . Figure (d)d shows the risks of three different classifiers for , while rotating the decision boundary around the origin. The 0-1-loss associated with the majority vote classifier does not rely on the norm . However, we clearly see that probit-loss of the Gibbs classifier converges to as increases (the dashed lines correspond to the convex surrogate of the probit-loss given by Equation (15)). Thus, thanks to the specialization of to the linear classifier, the smoothness of the surrogate loss is regularized by the norm .

4 Two New Domain Adaptation Bounds

The originality of our contribution is to theoretically design two domain adaptation frameworks suitable for the PAC-Bayesian approach. In Section 4.1, we first follow the spirit of the seminal works recalled in Section 2 by proving a similar trade-off for the Gibbs classifier. Then in Section 4.2, we propose a novel trade-off based on the specificities of the Gibbs classifier that come from Equation (7).

4.1 In the Spirit of the Seminal Works

In the following, while the domain adaptation bounds presented in Section 2 focus on a single classifier, we first define a -average divergence measure to compare the marginals. This leads us to derive our first domain adaptation bound.

4.1.1 A Domains’ Divergence for PAC-Bayesian Analysis

As discussed in Section 2.2, the derivation of generalization ability in domain adaptation critically needs a divergence measure between the source and target marginals. For the PAC-Bayesian setting, we propose a domain disagreement pseudometric101010A pseudometric is a metric for which the property is relaxed to . to measure the structural difference between domain marginals in terms of posterior distribution over . Since we are interested in learning a -weighted majority vote leading to good generalization guarantees, we propose to follow the idea spurred by the -bound of Equation (4): Given a source domain , a target domain , and a posterior distribution , if and are similar, then and are similar when and are also similar. Thus, the domains and are close according to if the expected disagreement over the two domains tends to be close. We then define our pseudometric as follows.

Definition 1

Let be a hypothesis class. For any marginal distributions and over , any distribution on , the domain disagreement between  and  is defined by

Note that is symmetric and fulfills the triangle inequality.

4.1.2 Comparison of the -divergence and our domain disagreement

While the -divergence of Theorem 1 is difficult to jointly optimize with the empirical source error, our empirical disagreement measure is easier to manipulate: We simply need to compute the -average of the classifiers disagreement instead of finding the pair of classifiers that maximizes the disagreement. Indeed, depends on the majority vote, which suggests that we can directly minimize it via its empirical counterpart. This can be done without instance reweighting, space representation changing or family of classifiers modification. On the contrary,  is a supremum over all and hence, does not depend on the classifier on which the risk is considered. Moreover, (the -average) is lower than the (the worst case). Indeed, for every and over , we have

4.1.3 A Domain Adaptation Bound for the Stochastic Gibbs Classifier

We now derive our first main result in the following theorem: A domain adaptation bound relevant in a PAC-Bayesian setting, and that relies on the domain disagreement of Definition 1.

Theorem 4

Let be a hypothesis class. We have

where is the deviation between the expected joint errors (Equation 6) of on the target and source domains:

(16)

Proof.  First, from Equation (7), we recall that, given a domain on and a distribution over , we have

Therefore,

 

4.1.4 Meaningful Quantities

Similar to the bounds of Theorems 1 and 2, our bound can be seen as a trade-off between different quantities. Concretely, the terms and are akin to the first two terms of the domain adaptation bound of Theorem 1: is the -average risk over on the source domain, and measures the -average disagreement between the marginals but is specific to the current model depending on . The other term measures the deviation between the expected joint target and source errors of . According to this theory, a good domain adaptation is possible if this deviation is low. However, since we suppose that we do not have any label in the target sample, we cannot control or estimate it. In practice, we suppose that is low and we neglect it. In other words, we assume that the labeling information between the two domains is related and that considering only the marginal agreement and the source labels is sufficient to find a good majority vote. Another important point is that the above theorem improves the one we proposed in pbda in two points111111More details are given in our research report (pbda_long).. On the one hand, this bound is not degenerated when the source and target distributions are the same or close. On the other hand, our result contains only the half of contrary to our first bound proposed in pbda. Finally, due to the dependence of and on the learned posterior, our bound is, in general incomparable with the ones of Theorems 1 and 2. However, it brings the same underlying idea: Supposing that the two domains are sufficiently related, one must look for a model that minimizes a trade-off between its source risk and a distance between the domains’ marginal.

4.2 A Novel Perspective on Domain Adaptation

In this section, we introduce an original approach to upper-bound the non-estimable risk of a -weighted majority vote on a target domain  thanks to a term depending on its marginal distribution , another one on a related source domain , and a term capturing the “volume” of the source distribution uninformative for the target task. We base our bound on Equation (7) (recalled below) that decomposes the Gibbs classifier into the trade-off between the half of the expected disagreement of Equation (5) and the expected joint error of Equation (6):

(7)

A key observation is that the voters’ disagreement does not rely on labels; we can compute using the marginal distribution . Thus, in the present domain adaptation context, we have access to even if the target labels are unknown. However, the expected joint error can only be computed on the labeled source domain, that is what we kept in mind to define our new domain divergence.

4.2.1 Another Domain Divergence for the PAC-Bayesian Approach

We design a domains’ divergence that allows us to link the target joint error with the source one by reweighting the latter. This new divergence is called the -divergence and is parametrized by a real value  :

(17)

It is worth noting that considering some values allow us to recover well-known divergences. For instance, choosing relates our result to the -distance, between the domains as Moreover, we can link to the Rényi divergence121212For , we can easily show , where is the Rényi divergence between and ., which has led to generalization bounds in the specific context of importance weighting (CortesMM10). We denote the limit case by

(18)

with the support of the domain . The -divergence handles the input space areas where the source domain support and the target domain support intersect. It seems reasonable to assume that, when adaptation is achievable, such areas are fairly large. However, it is likely that is not entirely included in . We denote the distribution of conditional to