1 Introduction
The generalization bound measures the probability that a function, chosen from a function class by an algorithm, has a sufficiently small error and it plays an important role in statistical learning theory
[see 29, 12]. The generalization bounds have been widely used to study the consistency of the ERMbased learning process [29], the asymptotic convergence of empirical process [28] and the learnability of learning models [10]. Generally, there are three essential aspects to obtain the generalization bounds of a specific learning process: complexity measures of function classes, deviation (or concentration) inequalities and symmetrization inequalities related to the learning process. For example, Van der Vaart and Wellner [28] presented the generalization bounds based on the Rademacher complexity and the covering number, respectively. Vapnik [29] gave the generalization bounds based on the VapnikChervonenkis (VC) dimension. Bartlett et al. [1] proposed the local Rademacher complexity and obtained a sharp generalization bound for a particular function class . Hussain and ShaweTaylor [16] showed improved loss bounds for multiple kernel learning. Zhang [31] analyzed the Bennetttype generalization bounds of the i.i.d. learning process.It is noteworthy that the aforementioned results of statistical learning theory are all built under the assumption that training and test data are drawn from the same distribution (or briefly called the samedistribution assumption). This assumption may not be valid in the situation that training and test data have different distributions, which will arise in many practical applications including speech recognition [17]
and natural language processing
[8]. Domain adaptation has recently been proposed to handle this situation and it is aimed to apply a learning model, trained by using the samples drawn from a certain domain (source domain), to the samples drawn from another domain (target domain) with a different distribution [see 6, 30, 9, 2, 5]. There have been some research works on the theoretical analysis of two types of domain adaptation. In the first type, the learner receives training data from several source domains, known as domain adaptation with multiple sources [see 2, 13, 14, 21, 22, 32]. In the second type, the learner minimizes a convex combination of the source and the target empirical risks, termed as domain adaptation combining source and target data [see 7, 2, 32].Without loss of generality, this paper is mainly concerned with a more representative (or general) type of domain adaptation, which combines data from multiple sources and one target (or briefly called representative domain adaptation). Evidently, it covers both of the aforementioned two types: domain adaptation with multiple sources and domain adaptation combining source and target. Thus, the results of this paper are more general than the previous works and some of existing results can be regarded as the special cases of this paper [see 32]. We brief the main contributions of this paper as follows.
1.1 Overview of Main Results
In this paper, we present a new framework to obtain the generalization bounds of the learning process for representative domain adaptation. Based on the resulting bounds, we then analyze the asymptotical properties of the learning process. There are four major aspects in the framework: (i) the quantity measuring the difference between two domains; (ii) the complexity measure of function classes; (iii) the deviation inequalities for multiple domains; (iv) the symmetrization inequality for representative domain adaptation.
As shown in some previous works [22, 20, 2], one of the major challenges in the theoretical analysis of domain adaptation is to measure the difference between two domains. Different from the previous works, we use the integral probability metric to measure the difference between the distributions of two domains. Moreover, we also give a comparison with the quantities proposed in the previous works.
Generally, in order to obtain the generalization bounds of a learning process, one needs to develop the related deviation (or concentration) inequalities of the learning process. Here, we use a martingale method to develop the related Hoeffdingtype, Bennetttype and McDiarmidtype deviation inequalities for multiple domains, respectively. Moreover, in the situation of domain adaptation, since the source domain differs from the target domain, the desired symmetrization inequality for domain adaptation should incorporate some quantity to reflect the difference. From this point of view, we then obtain the related symmetrization inequality incorporating the integral probability metric that measures the difference between the distributions of the source and the target domains.
By applying the derived inequalities, we obtain two types of generalization bounds of the learning process for representative domain adaptation: Hoeffdingtype and Bennetttype, both of which are based on the uniform entropy number. Moreover, we use the McDiarmidtype deviation inequality to obtain the generalization bounds based on the Rademacher complexity. It is noteworthy that, based on the relationship between the integral probability metric and the discrepancy distance (or divergence), the proposed framework can also lead to the generalization bounds by incorporating the discrepancy distance (or divergence) [see Section 3 and Remark 5.1].
Based on the resulting generalization bounds, we study the asymptotic convergence and the rate of convergence of the learning process for representative domain adaptation. In particular, we analyze the factors that affect the asymptotical behavior of the learning process and discuss the choices of parameters in the situation of representative domain adaptation. The numerical experiments also support our theoretical findings. Meanwhile, we compare our results with the existing results of domain adaptation and the related results under the samedistribution assumption. Note that the representative domain adaption refers to a more general situation that covers both of domain adaptation with multiple sources and domain adaptation combining source and target. Thus, our results include many existing works as special cases. Additionally, our analysis can be applied to analyze the key quantities studied in Mansour et al. [20], BenDavid et al. [2] [see Section 3].
1.2 Organization of the Paper
The rest of this paper is organized as follows. Section 2 introduces the problem studied in this paper. Section 3 introduces the integral probability metric and then gives a comparison with other quantities. In Section 4, we introduce the uniform entropy number and the Rademacher complexity. Section 5 provides the generalization bounds for representative domain adaptation. In Section 6, we analyze the asymptotic behavior of the learning process for representative domain adaptation. Section 7 shows the numerical experiments supporting our theoretical findings. We brief the related works in Section 8 and the last section concludes the paper. In Appendix A, we present the deviation inequalities and the symmetrization inequality, and all proofs are given in Appendix B.
2 Problem Setup
We denote and as the th source domain and the target domain, respectively. Set . Let and stand for the distributions of the input spaces and , respectively. Denote and as the labeling functions of () and , respectively.
In representative domain adaptation, the inputspace distributions and differ from each other, or and differ from each other, or both cases occur. There are some (but not enough) samples drawn from the target domain in addition to a large amount of i.i.d. samples drawn from each source domain with for any .
Given two parameters and with , denote the convex combination of the weighted empirical risk of multiplesource data and the empirical risk of the target data as:
(1) 
where
is the loss function,
(2) 
and
(3) 
Given a function class , we denote as the function that minimizes the empirical quantity over and it is expected that will perform well on the target expected risk:
(4) 
that is, approximates the labeling function as precisely as possible.
Note that when , such a learning process provides the domain adaptation with multiple sources [see 14, 22, 32]; setting provides the domain adaptation combining source and target data [see 2, 7, 32]; setting and provides the basic domain adaptation with one single source [see 3].
In this learning process, we are mainly interested in the following two types of quantities:

, which corresponds to the estimation of the expected risk in the target domain
from the empirical quantity ; 
, which corresponds to the performance of the algorithm for domain adaptation it uses,
where is the function that minimizes the expected risk over .
we have
and thus
This shows that the asymptotic behaviors of the aforementioned two quantities, when the sample numbers (or part of them) go to infinity, can both be described by the supremum:
(5) 
which is the socalled generalization bound of the learning process for representative domain adaptation.
3 Integral Probability Metric
In the theoretical analysis of domain adaptation, one of main challenges is to find a quantity to measure the difference between the source domain and the target domain , and then one can use the quantity to derive generalization bounds for domain adaptation [see 21, 22, 2, 3]. Different from the existing works [e.g. 21, 22, 2, 3], we use the integral probability metric to measure the difference between and . We also discuss the relationship between the integral probability metric and other quantities proposed in existing works: the divergence and the discrepancy distance [see 2, 20].
3.1 Integral Probability Metric
BenDavid et al. [2, 3] introduced the divergence to derive the generalization bounds based on the VC dimension under the condition of “close”. Mansour et al. [20] obtained the generalization bounds based on the Rademacher complexity by using the discrepancy distance. Both quantities are aimed to measure the difference between two inputspace distributions and . Moreover, Mansour et al. [22] used the Rényi divergence to measure the distance between two distributions. In this paper, we use the following quantity to measure the difference between the distributions of the source and the target domains:
Definition 3.1
Given two domains , let and
be the random variables taking values from
and , respectively. Let be a function class. We define(9) 
where the expectations and are taken with respect to the distributions of the domains and , respectively.
The quantity
is termed as the integral probability metric that plays an important role in probability theory for measuring the difference between two probability distributions
[see 33, 25, 24, 26]. Recently, Sriperumbudur et al. [27] gave a further investigation and proposed an empirical method to compute the integral probability metric. As mentioned by Müller [24] [see page 432], the quantity is a semimetric and it is a metric if and only if the function class separates the set of all signed measures with . Namely, according to Definition 3.1, given a nontrivial function class , the quantity is equal to zero if the domains and have the same distribution.3.2 Divergence and Discrepancy Distance
Before the formal discussion, we briefly introduce the related quantities proposed in the previous works of BenDavid et al. [2], Mansour et al. [20].
3.2.1 Divergence
In classification tasks, by setting as the absolutevalue loss function (), BenDavid et al. [2] introduced a variant of the divergence:
with the condition of “close”: there exists a such that
(11) 
One of the main results in BenDavid et al. [2] can be summarized as follows: when or , BenDavid et al. [2] derived the VCdimensionbased upper bounds of
(12) 
by using the summation of , where minimizes the expected risk over [see 2, Theorems 3 4].
There are two points that should be noted:

recalling (11), the condition of “close” actually places a restriction among the function class and the labeling functions . In the optimistic case, both of and are contained by the function class and are the same, then .
3.2.2 Discrepancy Distance
In both classification and regression tasks, given a function class and a loss function , Mansour et al. [20] defined the discrepancy distance as
(13) 
and then used this quantity to obtain the generalization bounds based on the Rademacher complexity. As mentioned by Mansour et al. [20], the quantities and match in the setting of classification tasks with being the absolutevalue loss function, while the usage of does not require the “close” condition. Instead, the authors achieved the upper bound of
by using the summation
where (resp. ) minimizes the expected risk (resp. ) over . It can be equivalently rewritten as follows [see 20, Theorems 8 9]: the upper bound
(14) 
can be bounded by using the summation
(15) 
There are also two points that should be noted:
Next, we discuss the relationship between and the aforementioned two quantities: the divergence and the discrepancy distance. Recalling Definition 3.1, since there is no limitation on the function class , the integral probability metric can be used in both classification and regression tasks. Therefore, we only consider the relationship between the integral probability metric and the discrepancy distance .
3.3 Relationship between and
From Definition 3.1 and (3.1), the integral probability metric measures the difference between the distributions of the two domains and . However, as addressed in Section 2, if a domain differs from another domain , there are three possibilities: the inputspace distribution differs from , or differs from , or both of them occur. Therefore, it is necessary to consider two kinds of differences: the difference between the inputspace distributions and and the difference between the labeling functions and . Next, we will show that the integral probability metric can be bounded by using two separate quantities that can measure the difference between and and the difference between and , respectively.
As shown in (13), the quantity actually measures the difference between the inputspace distributions and . Moreover, we introduce another quantity to measure the difference between the labeling functions and :
Definition 3.2
Given a loss function and a function class , we define
(16) 
Note that if both of the loss function and the function class are nontrivial (or is nontrivial), the quantity is a (semi)metric between the labeling functions and . In fact, it is not hard to verify that satisfies the triangle inequality and is equal to zero if and match.
By combining (3.1), (13) and (16), we have
and thus
(17) 
which implies that the integral probability metric can be bounded by the summation of the discrepancy distance and the quantity , which measure the difference between the inputspace distributions and and the difference between the labeling functions and , respectively.
Compared with (11) and (15), the integral probability metric provides a new mechanism to capture the difference between two domains, where the difference between labeling functions and is measured by a (semi)metric .
Remark 3.1
As shown in (3.1) and (13), the integral probability metric takes the supremum of over , and the discrepancy distance takes the supremum of and over simultaneously. Consider a specific domain adaptation situation: the labeling function is close to and meanwhile both of them are contained in the function class . In this case, can be very small even though is large. Thus, the integral probability metric is more suitable for such domain adaptation setting than the discrepancy distance.
4 Uniform Entropy Number and Rademacher Complexity
In this section, we introduce the definitions of the uniform entropy number and the Rademacher complexity, respectively.
4.1 Uniform Entropy Number
Generally, the generalization bound of a certain learning process is achieved by incorporating the complexity measure of function classes, e.g., the covering number, the VC dimension and the Rademacher complexity. The results of this paper are based on the uniform entropy number that is derived from the concept of the covering number and we refer to Mendelson [23] for more details about the uniform entropy number. The covering number of a function class is defined as follows:
Definition 4.1
Let be a function class and be a metric on . For any , the covering number of at radius with respect to the metric , denoted by is the minimum size of a cover of radius .
In some classical results of statistical learning theory, the covering number is applied by letting be the distributiondependent metric. For example, as shown in Theorem 2.3 of Mendelson [23], one can set as the norm and then derives the generalization bound of the i.i.d. learning process by incorporating the expectation of the covering number, that is, . However, in the situation of domain adaptation, we only know the information of source domain, while the expectation is dependent on distributions of both source and target domains because . Therefore, the covering number is no longer applicable to our scheme for obtaining the generalization bounds for representative domain adaptation. In contrast, the uniform entropy number is distributionfree and thus we choose it as the complexity measure of function classes to derive the generalization bounds.
For clarity of presentation, we give some useful notations for the following discussion. For any , given a sample set drawn from the source domain , we denote as the sample set drawn from such that the ghost sample has the same distribution as that of for any and any . Again, given a sample set drawn from the target domain , let be the ghost sample set of . Denote and for any , respectively. Given any and any with , we introduce a variant of the norm: for any ,
It is noteworthy that the variant of the norm is still a norm on the functional space, which can be easily verified by using the definition of norm, so we omit it here. In the situation of representative domain adaptation, by setting the metric as , we then define the uniform entropy number of with respect to the metric as
(18) 
with .
4.2 Rademacher Complexity
The Rademacher complexity is one of the most frequently used complexity measures of function classes and we refer to Van der Vaart and Wellner [28], Mendelson [23] for details.
Definition 4.2
Let be a function class and be a sample set drawn from . Denote be a set of random variables independently taking either value from with equal probability. The Rademacher complexity of is defined as
(19) 
with its empirical version given by
where stands for the expectation taken with respect to all random variables and , and stands for the expectation only taken with respect to the random variables .
5 Generalization Bounds for Representative Domain Adaptation
Based on the uniform entropy number defined in (18), we first present two types of the generalization bounds for representative domain adaptation: Hoeffdingtype and Bennetttype, which are derived from the Hoeffdingtype deviation inequality and the Bennetttype deviation inequality respectively. Moreover, we obtain the bounds based on the Rademacher complexity via the McDiarmidtype deviation inequality.
5.1 Hoeffdingtype Generalization Bounds
The following theorem presents the Hoeffdingtype generalization bound for representative domain adaptation:
Theorem 5.1
Assume that is a function class consisting of the bounded functions with the range . Let and with . Then, given any , we have for any such that
with probability at least ,
(20) 
where , ,
Comments
There are no comments yet.