1 Introduction
As human beings, we learn from what we saw before. Think about our education process: When a student attends to a new course, the knowledge he has acquired from previous courses helps him to understand the current one. However, traditional machine learning approaches assume that the learning and test data are drawn from the same probability distribution. This assumption may be too strong for a lot of realworld tasks, in particular those where we desire to reuse a model from one task to another one. For instance, a spam filtering system suitable for one user can be poorly adapted to another who receives significantly different emails. In other words, the learning data associated with one or several users could be unrepresentative of the test data coming from another one. This enhances the need to design methods for adapting a classifier from learning (source) data to test (target) data. One solution to tackle this issue is to consider the
domain adaptation framework^{1}^{1}1The reader can refer to the surveys proposed in (JiangSurvey08; QuioneroCandela:2009; Margolis2011; wang2018deep) (domain adaptation is often associated with transfer learning (PanTLTKDE09))., which arises when the distribution generating the target data (the target domain) differs from the one generating the source data (the source domain). Note that, it is well known that domain adaptation is a hard and challenging task even under strong assumptions (DavidAISTAT10; BenDavid12; BenDavidU14).Many approaches exist in the literature to address domain adaptation, often with the same underlying idea: If we are able to apply a transformation in order to “move closer” the distributions, then we can learn a model with the available labels. This process can be performed by reweighting the importance of labeled data (HuangSGBSnips06; SugiyamaNIPS07; CortesMM10; cortes15). This is one of the most popular methods when one wants to deal with the covariateshift issue (e.g., HuangSGBSnips06; sugiyama2008direct), where source and target domains diverge only in their marginals, i.e., when they share the same labeling function. Another technique is to exploit selflabeling procedures, where the objective is to transfer the source labels to the target unlabeled points (e.g., BruzzoneM10S; habrard2013iterative; PVMinCq)
. A third solution is to learn a new common representation space from the unlabeled part of source and target data. Then, a standard supervised learning algorithm can be run on the source labeled instances
(e.g., glorot2011domain; Chen12; courty2016optimal; ganin16; CourtyFTR17). A slightly different approach, known as hypothesis transfer learning, aims at directly transferring the learned source model to the target domain kuzborskij2013stability; ThesisKuzborskij2018theory.The work presented in this paper stands into a fifth popular class of approaches, which has been especially explored to derive generalization bounds for domain adaptation. This kind of approaches relies on the control of a measure of divergence/distance between the source distribution and target distribution (e.g. BenDavidNIPS06; BenDavidMLJ2010; li2007bayesian; Zhang12; morvant12; CortesM14; redko2017theoretical). Such a distance usually depends on the set of hypotheses considered by the learning algorithm. The intuition is that one must look for a set that minimizes the distance between the distributions while preserving good performances on the source data; if the distributions are close under this measure, then generalization ability may be “easier” to quantify. In fact, defining such a measure to quantify how much the domains are related is a major issue in domain adaptation. For example, for binary classification with the
BenDavidMLJ2010 BenDavidMLJ2010; BenDavidNIPS06 have considered the divergence between the source and target marginal distributions. This quantity depends on the maximal disagreement between two classifiers, and allowed them to deduce a domain adaptation generalization bound based on the VCdimension theory. The discrepancy distance proposed by MansourCOLT09 generalizes this divergence to realvalued functions and more general losses, and is used to obtain a generalization bound based on the Rademacher complexity. In this context, CortesM11; CortesM14 have specialized the minimization of the discrepancy to regression with kernels. In these situations, domain adaptation can be viewed as a multiple tradeoff between the complexity of the hypothesis class , the adaptation ability of according to the divergence between the marginals, and the empirical source risk. Moreover, other measures have been exploited under different assumptions, such as the Rényi divergence suitable for importance weighting (MansourMR09), or the measure proposed by Zhang12 which takes into account the source and target true labeling, or the Bayesian divergence prior (li2007bayesian) which favors classifiers closer to the best source model, or the Wassertein distance redko2017theoretical that justifies the usefulness of optimal transport strategy in domain adaptation courty2016optimal; CourtyFTR17. However, a majority of methods prefer to perform a twostep approach: (i) First construct a suitable representation by minimizing the divergence, then (ii) learn a model on the source domain in the new representation space.Given the multitude of concurrent approaches for domain adaptation, and the nonexistence of a predominant one, we believe that the problem still needs to be studied from different perspectives for a global comprehension to emerge. We aim to contribute to this study from a PACBayesian standpoint. One particularity of the PACBayesian theory (first set out by Mcallester99a) is that it focuses on algorithms that output a posterior distribution over a classifier set (i.e., a average over ) rather than just a single predictor (as in BenDavidNIPS06, and other works cited above). More specifically, we tackle the unsupervised domain adaptation setting for binary classification, where no target labels are provided to the learner. We propose two domain adaptation analyses, both introduced separately in previous conference papers (pbda; dalc). We refine these results, and provide indepth comparison, full proofs and technical details. Our analyses highlight different angles that one can adopt when studying domain adaptation.
Our first approach follows the philosophy of the seminal work of BenDavidMLJ2010 BenDavidMLJ2010; BenDavidNIPS06 and MansourMR09
: The risk of the target model is upperbounded jointly by the model’s risk on the source distribution, a divergence between the marginal distributions, and a nonestimable term
^{2}^{2}2More precisely, this term can only be estimated in the presence of labeled data from both the source and the target domains. related to the ability to adapt in the current space. To obtain such a result, we define a pseudometric which is ideal for the PACBayesian setting by evaluating the domains’ divergence according to the average disagreement of the classifiers over the domains. Additionally, we prove that this domains’ divergence is always lower than the popular divergence, and is easily estimable from samples. Note that, based on this disagreement measure, we derived in a previous work (pbda) a first PACBayesian domain adaptation bound expressed as a averaging. We provide here a new version of this result, that does not change the underlying philosophy supported by the previous bound, but clearly improves the theoretical result: The domain adaptation bound is now tighter and easier to interpret.Our second analysis (introduced in dalc) consists in a target risk bound that brings an original way to think about domain adaptation problems. Concretely, the risk of the target model is still upperbounded by three terms, but they differ in the information they capture. The first term is estimable from unlabeled data and relies on the disagreement of the classifiers only on the target domain. The second term depends on the expected accuracy of the classifiers on the source domain. Interestingly, this latter is weighted by a divergence between the source and the target domains that enables controlling the relationship between domains. The third term estimates the “volume” of the target domain living apart from the source one^{3}^{3}3Here we do not focus on learning a new representation to help the adaptation: We directly aim at adapting in the current representation space., which has to be small for ensuring adaptation.
Thanks to these results, we derive PACBayesian generalization bounds for our two domain adaptation bounds. Then, in contrast to the majority of methods that perform a twostep procedure, we design two algorithms tailored to linear classifiers, called pbda and dalc, which jointly minimize the multiple tradeoffs implied by the bounds. On the one hand, pbda is inspired by our first analysis for which the first two quantities being, as usual in the PACBayesian approach, the complexity of the
weighted majority vote measured by a KullbackLeibler divergence and the empirical risk measured by the
average errors on the source sample. The third quantity corresponds to our domains’ divergence and assesses the capacity of the posterior distribution to distinguish some structural difference between the source and target samples. On the other hand, dalc is inspired by our second analysis from which we deduce that a good adaptation strategy consists in finding a weighted majority vote leading to a suitable tradeoff—controlled by the domains’ divergence—between the first two terms (and the usual KullbackLeibler divergence): Minimizing the first one corresponds to look for classifiers that disagree on the target domain, and minimizing the second one to seek accurate classifiers on the source.The rest of the paper is structured as follows. Section 2 deals with two seminal works on domain adaptation. The PACBayesian framework is then recalled in Section 3. Note that for the sake of completeness, we provide for the first time the explicit derivation of the algorithm PBGD3 (germain2009pac) tailored to linear classifiers in supervised learning. Our main contribution, which consists in two domain adaptation bounds suitable for PACBayesian learning, is presented in Section 4, the associated generalization bounds are derived in Section 5. Then, we design our new algorithms for PACBayesian domain adaptation in Section 6, that we experiment in Section 7. We conclude in Section 8.
2 Domain Adaptation Related Works
In this section, we review the two seminal works in domain adaptation that are based on a divergence measure between the domains BenDavidMLJ2010; BenDavidNIPS06; MansourCOLT09.
2.1 Notations and Setting
We consider domain adaptation for binary classification tasks where is the input space of dimension , and is the output/label set. The source domain and the target domain are two different distributions (unknown and fixed) over , and being the respective marginal distributions over . We tackle the challenging task where we have no target labels, known as unsupervised domain adaptation. A learning algorithm is then provided with a labeled source sample consisting of examples drawn i.i.d.^{4}^{4}4i.i.d. stands for independent and identically distributed. from , and an unlabeled target sample consisting of examples drawn i.i.d. from . We denote the distribution of a sample by . We suppose that is a set of hypothesis functions for to . The expected source error and the expected target error of over , respectively , are the probability that errs on the entire distribution , respectively ,
where is the loss function which returns if and otherwise. The empirical source error of on the learning source sample is
The main objective in domain adaptation is then to learn—without target labels—a classifier leading to the lowest expected target error .
Given two classifiers , we also introduce the notion of expected source disagreement and the expected target disagreement , which measure the probability that and do not agree on the respective marginal distributions, and are defined by
The empirical source disagreement on and the empirical target disagreements on are
Note that, depending on the context, denotes either the source labeled sample or its unlabeled part . We can remark that the expected error on a distribution can be viewed as a shortcut notation for the expected disagreement between a hypothesis and a labeling function that assigns the true label to an example description with respect to . We have
2.2 Necessity of a Domains’ Divergence
The domain adaptation objective is to find a lowerror target hypothesis, even if the target labels are not available. Even under strong assumptions, this task can be impossible to solve (DavidAISTAT10; BenDavid12; BenDavidU14). However, for deriving generalization ability in a domain adaptation situation (with the help of a domain adaptation bound), it is critical to make use of a divergence between the source and the target domains: The more similar the domains, the easier the adaptation appears. Some previous works have proposed different quantities to estimate how a domain is close to another one (BenDavidNIPS06; li2007bayesian; MansourCOLT09; MansourMR09; BenDavidMLJ2010; Zhang12). Concretely, two domains and differ if their marginals and are different, or if the source labeling function differs from the target one, or if both happen. This suggests taking into account two divergences: One between and , and one between the labeling. If we have some target labels, we can combine the two distances as done by Zhang12. Otherwise, we preferably consider two separate measures, since it is impossible to estimate the best target hypothesis in such a situation. Usually, we suppose that the source labeling function is somehow related to the target one, then we look for a representation where the marginals and appear closer without losing performances on the source domain.
2.3 Domain Adaptation Bounds for Binary Classification
We now review the first two seminal works which propose domain adaptation bounds based on a divergence between the two domains.
First, under the assumption that there exists a hypothesis in that performs well on both the source and the target domain, BenDavidMLJ2010 BenDavidNIPS06; BenDavidMLJ2010 have provided the following domain adaptation bound.
Theorem 1 (BenDavidMLJ2010; BenDavidNIPS06)
Let be a (symmetric^{5}^{5}5In a symmetric hypothesis space , for every , its inverse is also in .) hypothesis class. We have
(1) 
where
is the distance between marginals and , and is the error of the best hypothesis overall
This bound relies on three terms. The first term is the classical source domain expected error. The second term depends on and corresponds to the maximum deviation between the source and target disagreement between two hypotheses of . In other words, it quantifies how hypothesis from can “detect” differences between these marginals: The lower this measure is for a given , the better are the generalization guarantees. The last term is related to the best hypothesis over the domains and acts as a quality measure of in terms of labeling information. If does not have a good performance on both the source and the target domain, then there is no way one can adapt from this source to this target. Hence, as pointed out by the authors, Equation (1) expresses a multiple tradeoff between the accuracy of some particular hypothesis , the complexity of (quantified in BenDavidMLJ2010 with the usual VCbound theory), and the “incapacity” of hypotheses of to detect difference between the source and the target domain.
Second, MansourCOLT09 have extended the distance to the discrepancy divergence for regression and any symmetric loss fulfilling the triangle inequality. Given such a loss, the discrepancy between and is
Note that with the loss in binary classification, we have
Even if these two divergences may coincide, the following domain adaptation bound of MansourCOLT09 differs from Theorem 1.
Theorem 2 (MansourCOLT09)
Let be a (symmetric) hypothesis class. We have
(2) 
with the disagreement between the ideal hypothesis on the target and source domains: and
Equation (2) can be tighter than Equation (1)^{6}^{6}6Equation (1) can lead to an error term three times higher than Equation (2) in some cases (more details in MansourCOLT09). since it bounds the difference between the target error of a classifier and the one of the optimal . Based on Theorem 2 and a Rademacher complexity analysis, MansourCOLT09 provide a generalization bound on the target risk, that expresses a tradeoff between the disagreement (between and the best source hypothesis ), the complexity of , and—again—the “incapacity” of hypotheses to detect differences between the domains.
To conclude, the domain adaptation bounds of Theorems 1 and 2 suggest that if the divergence between the domains is low, a lowerror classifier over the source domain might perform well on the target one. These divergences compute the worst case of the disagreement between a pair of hypothesis. We propose in Section 4 two average case approaches by making use of the essence of the PACBayesian theory, which is known to offer tight generalization bounds (Mcallester99a; germain2009pac; ParradoHernandez12). Our first approach (see Section 4.1) stands in the philosophy of these seminal works, and the second one (see Section 4.2) brings a different and novel point of view by taking advantages of the PACBayesian framework we recall in the next section.
3 PACBayesian Theory in Supervised Learning
Let us now review the classical supervised binary classification framework called the PACBayesian theory, first introduced by Mcallester99a. This theory succeeds to provide tight generalization guarantees—without relying on any validation set—on weighted majority votes, i.e., for ensemble methods (dietterich2000ensemble; re2012ensemble) where several classifiers (or voters) are assigned a specific weight. Throughout this section, we adopt an algorithm design perspective. Indeed, the PACBayesian analysis of domain adaptation provided in the forthcoming sections is oriented by the motivation of creating new adaptive algorithms.
3.1 Notations and Setting
Traditionally, PACBayesian theory considers weighted majority votes over a set of binary hypothesis, often called voters. Let be a fixed yet unknown distribution over , and be a learning set where each example are drawn i.i.d. from . Then, given a prior distribution over (independent from the learning set ), the “PACBayesian” learner aims at finding a posterior distribution over leading to a weighted majority vote (also called the Bayes classifier) with good generalization guarantees and defined by
However, minimizing the risk of , defined as
is known to be NPhard. To tackle this issue, the PACBayesian approach deals with the risk of the stochastic Gibbs classifier associated with and closely related to . In order to predict the label of an example , the Gibbs classifier first draws a hypothesis from according to , then returns as label. Then, the error of the Gibbs classifier on a domain corresponds to the expectation of the errors over :
(3) 
In this setting, if misclassifies , then at least half of the classifiers (under ) errs on . Hence, we have
Another result on the relation between and is the bound of Lacasse07 expressed as
(4) 
where corresponds to the expected disagreement of the classifiers over :
(5) 
Equation (4) suggests that for a fixed numerator, i.e., a fixed risk of the Gibbs classifier, the best weighted majority vote is the one associated with the lowest denominator, i.e., with the greatest disagreement between its voters (for further analysis, see graalneverending).
We now introduce the notion of expected joint error of a pair of classifiers drawn according to the distribution , defined as
(6) 
From the definitions of the expected disagreement and the joint error, Lacasse07; graalneverending observed that, given a domain on and a distribution on , we can decompose the Gibbs risk as
(7) 
Indeed, we have
Lastly, PACBayesian theory allows one to bound the expected error in terms of two major quantities: The empirical error
estimated on a sample , and the KullbackLeibler divergence
We present in the next section the PACBayesian theorem proposed by catoni2007pac.^{7}^{7}7Two other common forms of the PACBayesian theorem are the one of Mcallester99a and the one of Seeger02; Langford05. We refer the reader to our research report (pbda_long) for a larger variety of PACBayesian theorems in a domain adaptation context.
3.2 A Usual PACBayesian Theorem
Usual PACBayesian theorems suggest that, in order to minimize the expected risk, a learning algorithm should perform a tradeoff between the empirical risk minimization and KLdivergence minimization (roughly speaking the complexity term). The nature of this tradeoff can be explicitly controlled in Theorem 3 below. This PACBayesian result, first proposed by catoni2007pac
, is defined with a hyperparameter (here named
). It appears to be a natural tool to design PACBayesian algorithms. We present this result in the simplified form suggested by germain09b.Theorem 3 (catoni2007pac)
For any domain over , for any set of hypotheses , any prior distribution over , any , and any real number , with a probability at least over the random choice of , for every posterior distribution on , we have
(8) 
Similarly to mcallesterkeshet11, we could choose to restrict to obtain a slightly looser but simpler bound. Using to upperbound on the righthand side of Equation (8), we obtain
(9) 
The bound of Theorem 3—in both forms of Equations (8) and (9)—has two appealing characteristics. First, choosing , the bound becomes consistent: It converges to as grows. Second, as described in Section 3.3
, its minimization is closely related to the minimization problem associated with the Support Vector Machine (
svm) algorithm when is an isotropic Gaussian over the space of linear classifiers (germain2009pac). Hence, the value allows us to control the tradeoff between the empirical risk and the “complexity term” .3.3 Supervised PACBayesian Learning of Linear Classifiers
Let us consider as a set of linear classifiers in a dimensional space. Each is defined by a weight vector :
where denotes the dot product.
By restricting the prior and the posterior distributions over
to be Gaussian distributions,
Langford02 have specialized the PACBayesian theory in order to bound the expected risk of any linear classifier . More precisely, given a prior and a posterior defined as spherical Gaussians with identity covariance matrix respectively centered on vectors and , for any , we haveAn interesting property of these distributions—also seen as multivariate normal distributions,
and —is that the prediction of the weighted majority vote coincides with the one of the linear classifier . Indeed, we haveMoreover, the expected risk of the Gibbs classifier on a domain is then given by^{8}^{8}8The calculations leading to Equation (10) can be found in Langford05. For sake of completeness, we provide a slightly different derivation in B.
(10) 
where
(11) 
with Erf is the Gauss error function defined as
(12) 
Here, can be seen as a smooth surrogate of the 01loss function relying on . This function is sometimes called the probit–loss (e.g., mcallesterkeshet11). It is worth noting that plays an important role on the value of , but not on . Indeed, tends to as grows, which can provide very tight bounds (see the empirical analyses of AmbroladzePS06; germain2009pac). Finally, the KLdivergence between and becomes simply
and turns out to be a measure of complexity of the learned classifier.
3.3.1 Objective Function and Gradient
Based on the specialization of the PACBayesian theory to linear classifiers, germain2009pac suggested minimizing a PACBayesian bound on . For sake of completeness, we provide here more mathematical details than in the original conference paper (germain2009pac). In forthcoming Section 6, we will extend this supervised learning algorithm to the domain adaptation setting.
Given a sample and a hyperparameter , the learning algorithm performs gradient descent in order to find an optimal weight vector that minimizes
(13)  
It turns out that the optimal vector corresponds to the distribution minimizing the value of the bound on given by Theorem 3, with the parameter of the theorem being the hyperparameter of the learning algorithm. It is important to point out that PACBayesian theorems bound simultaneously for every on . Therefore, one can “freely” explore the domain of objective function to choose a posterior distribution that gives, thanks to Theorem 3, a bound valid with probability .
The minimization of Equation (13) by gradient descent corresponds to the learning algorithm called PBGD3 of germain2009pac. The gradient of is given the vector :
where is the derivative of at point .
Similarly to SVM, the learning algorithm PBGD3 realizes a tradeoff between the empirical risk—expressed by the loss —and the complexity of the learned linear classifier—expressed by the regularizer . This similarity increases when we use a kernel function, as described next.
3.3.2 Using a Kernel Function
The kernel trick allows to substitute inner products by a kernel function in Equation (13). If is a Mercer kernel, it implicitly represents a function that maps an example of into an arbitrary dimensional space, such that
Then, a dual weight vector encodes the linear classifier as a linear combination of examples of :
By the representer theorem (scholkopf01), the vector minimizing Equation (13) can be recovered by finding the vector that minimizes
(14) 
where is the kernel matrix of size .^{9}^{9}9It is nontrivial to show that the kernel trick holds when and are Gaussian over infinitedimensional feature space. As mentioned by mcallesterkeshet11, it is, however, the case provided we consider Gaussian processes as measure of distributions and over (infinite) . The same analysis holds for the kernelized versions of the two forthcoming domain adaptation algorithms (Section 6.3.3). That is, The gradient of is simply given the vector , with
for
3.3.3 Improving the Algorithm Using a Convex Objective
An annoying drawback of PBGD3 is that the objective function is nonconvex and the gradient descent implementation needs many random restarts. In fact, we made extensive empirical experiments after the ones described by germain2009pac and saw that PBGD3 achieves an equivalent accuracy (and at a fraction of the running time) by replacing the loss function of Equations (13) and (14) by its convex relaxation, which is
(15)  
The derivative of at point is then , in other words if , and otherwise. Figure (a)a illustrates the functions and . Note that the latter can be interpreted as a smooth version the svm’s hinge loss, . The toy experiment of Figure (d)d (described in the next subsection) provides another empirical evidence that the minima of and tend to coincide.

3.3.4 Illustration on a Toy Dataset
To illustrate the tradeoff coming into play in PBGD3 algorithm (and its convexified version), we conduct a small experiment on a twodimensional toy dataset. That is, we generate positive examples according a Gaussian of mean and negative examples generated by a Gaussian of mean
(both of these Gaussian have a unit variance), as shown by Figure
(c)c. We then compute the risks associated with linear classifiers , with . Figure (d)d shows the risks of three different classifiers for , while rotating the decision boundary around the origin. The 01loss associated with the majority vote classifier does not rely on the norm . However, we clearly see that probitloss of the Gibbs classifier converges to as increases (the dashed lines correspond to the convex surrogate of the probitloss given by Equation (15)). Thus, thanks to the specialization of to the linear classifier, the smoothness of the surrogate loss is regularized by the norm .4 Two New Domain Adaptation Bounds
The originality of our contribution is to theoretically design two domain adaptation frameworks suitable for the PACBayesian approach. In Section 4.1, we first follow the spirit of the seminal works recalled in Section 2 by proving a similar tradeoff for the Gibbs classifier. Then in Section 4.2, we propose a novel tradeoff based on the specificities of the Gibbs classifier that come from Equation (7).
4.1 In the Spirit of the Seminal Works
In the following, while the domain adaptation bounds presented in Section 2 focus on a single classifier, we first define a average divergence measure to compare the marginals. This leads us to derive our first domain adaptation bound.
4.1.1 A Domains’ Divergence for PACBayesian Analysis
As discussed in Section 2.2, the derivation of generalization ability in domain adaptation critically needs a divergence measure between the source and target marginals. For the PACBayesian setting, we propose a domain disagreement pseudometric^{10}^{10}10A pseudometric is a metric for which the property is relaxed to . to measure the structural difference between domain marginals in terms of posterior distribution over . Since we are interested in learning a weighted majority vote leading to good generalization guarantees, we propose to follow the idea spurred by the bound of Equation (4): Given a source domain , a target domain , and a posterior distribution , if and are similar, then and are similar when and are also similar. Thus, the domains and are close according to if the expected disagreement over the two domains tends to be close. We then define our pseudometric as follows.
Definition 1
Let be a hypothesis class. For any marginal distributions and over , any distribution on , the domain disagreement between and is defined by
Note that is symmetric and fulfills the triangle inequality.
4.1.2 Comparison of the divergence and our domain disagreement
While the divergence of Theorem 1 is difficult to jointly optimize with the empirical source error, our empirical disagreement measure is easier to manipulate: We simply need to compute the average of the classifiers disagreement instead of finding the pair of classifiers that maximizes the disagreement. Indeed, depends on the majority vote, which suggests that we can directly minimize it via its empirical counterpart. This can be done without instance reweighting, space representation changing or family of classifiers modification. On the contrary, is a supremum over all and hence, does not depend on the classifier on which the risk is considered. Moreover, (the average) is lower than the (the worst case). Indeed, for every and over , we have
4.1.3 A Domain Adaptation Bound for the Stochastic Gibbs Classifier
We now derive our first main result in the following theorem: A domain adaptation bound relevant in a PACBayesian setting, and that relies on the domain disagreement of Definition 1.
Theorem 4
Let be a hypothesis class. We have
where is the deviation between the expected joint errors (Equation 6) of on the target and source domains:
(16) 
Proof. First, from Equation (7), we recall that, given a domain on and a distribution over , we have
Therefore,
4.1.4 Meaningful Quantities
Similar to the bounds of Theorems 1 and 2, our bound can be seen as a tradeoff between different quantities. Concretely, the terms and are akin to the first two terms of the domain adaptation bound of Theorem 1: is the average risk over on the source domain, and measures the average disagreement between the marginals but is specific to the current model depending on . The other term measures the deviation between the expected joint target and source errors of . According to this theory, a good domain adaptation is possible if this deviation is low. However, since we suppose that we do not have any label in the target sample, we cannot control or estimate it. In practice, we suppose that is low and we neglect it. In other words, we assume that the labeling information between the two domains is related and that considering only the marginal agreement and the source labels is sufficient to find a good majority vote. Another important point is that the above theorem improves the one we proposed in pbda in two points^{11}^{11}11More details are given in our research report (pbda_long).. On the one hand, this bound is not degenerated when the source and target distributions are the same or close. On the other hand, our result contains only the half of contrary to our first bound proposed in pbda. Finally, due to the dependence of and on the learned posterior, our bound is, in general incomparable with the ones of Theorems 1 and 2. However, it brings the same underlying idea: Supposing that the two domains are sufficiently related, one must look for a model that minimizes a tradeoff between its source risk and a distance between the domains’ marginal.
4.2 A Novel Perspective on Domain Adaptation
In this section, we introduce an original approach to upperbound the nonestimable risk of a weighted majority vote on a target domain thanks to a term depending on its marginal distribution , another one on a related source domain , and a term capturing the “volume” of the source distribution uninformative for the target task. We base our bound on Equation (7) (recalled below) that decomposes the Gibbs classifier into the tradeoff between the half of the expected disagreement of Equation (5) and the expected joint error of Equation (6):
(7) 
A key observation is that the voters’ disagreement does not rely on labels; we can compute using the marginal distribution . Thus, in the present domain adaptation context, we have access to even if the target labels are unknown. However, the expected joint error can only be computed on the labeled source domain, that is what we kept in mind to define our new domain divergence.
4.2.1 Another Domain Divergence for the PACBayesian Approach
We design a domains’ divergence that allows us to link the target joint error with the source one by reweighting the latter. This new divergence is called the divergence and is parametrized by a real value :
(17) 
It is worth noting that considering some values allow us to recover wellknown divergences. For instance, choosing relates our result to the distance, between the domains as Moreover, we can link to the Rényi divergence^{12}^{12}12For , we can easily show , where is the Rényi divergence between and ., which has led to generalization bounds in the specific context of importance weighting (CortesMM10). We denote the limit case by
(18) 
with the support of the domain . The divergence handles the input space areas where the source domain support and the target domain support intersect. It seems reasonable to assume that, when adaptation is achievable, such areas are fairly large. However, it is likely that is not entirely included in . We denote the distribution of conditional to