# Generalization Bounds for Representative Domain Adaptation

In this paper, we propose a novel framework to analyze the theoretical properties of the learning process for a representative type of domain adaptation, which combines data from multiple sources and one target (or briefly called representative domain adaptation). In particular, we use the integral probability metric to measure the difference between the distributions of two domains and meanwhile compare it with the H-divergence and the discrepancy distance. We develop the Hoeffding-type, the Bennett-type and the McDiarmid-type deviation inequalities for multiple domains respectively, and then present the symmetrization inequality for representative domain adaptation. Next, we use the derived inequalities to obtain the Hoeffding-type and the Bennett-type generalization bounds respectively, both of which are based on the uniform entropy number. Moreover, we present the generalization bounds based on the Rademacher complexity. Finally, we analyze the asymptotic convergence and the rate of convergence of the learning process for representative domain adaptation. We discuss the factors that affect the asymptotic behavior of the learning process and the numerical experiments support our theoretical findings as well. Meanwhile, we give a comparison with the existing results of domain adaptation and the classical results under the same-distribution assumption.

## Authors

• 134 publications
• 284 publications
• 44 publications
• 82 publications
• ### Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence

In this paper, we present the Bennett-type generalization bounds of the ...
09/26/2013 ∙ by Chao Zhang, et al. ∙ 0

• ### Predicting the Success of Domain Adaptation in Text Similarity

Transfer learning methods, and in particular domain adaptation, help exp...
06/08/2021 ∙ by Nicolai Pogrebnyakov, et al. ∙ 0

• ### Scatter Component Analysis: A Unified Framework for Domain Adaptation and Domain Generalization

10/15/2015 ∙ by Muhammad Ghifary, et al. ∙ 0

• ### Hoeffding-Type and Bernstein-Type Inequalities for Right Censored Data

We present Hoeffding-type and Bernstein-type inequalities for right-cens...
03/05/2019 ∙ by Yair Goldberg, et al. ∙ 0

• ### Risk Bounds for Infinitely Divisible Distribution

In this paper, we study the risk bounds for samples independently drawn ...
02/14/2012 ∙ by Chao Zhang, et al. ∙ 0

We present a novel instance based approach to handle regression tasks in...
06/15/2020 ∙ by Antoine de Mathelin, et al. ∙ 0

• ### Domain Adaptation for Enterprise Email Search

In the enterprise email search setting, the same search engine often pow...
06/19/2019 ∙ by Brandon Tran, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The generalization bound measures the probability that a function, chosen from a function class by an algorithm, has a sufficiently small error and it plays an important role in statistical learning theory

[see 29, 12]. The generalization bounds have been widely used to study the consistency of the ERM-based learning process [29], the asymptotic convergence of empirical process [28] and the learnability of learning models [10]. Generally, there are three essential aspects to obtain the generalization bounds of a specific learning process: complexity measures of function classes, deviation (or concentration) inequalities and symmetrization inequalities related to the learning process. For example, Van der Vaart and Wellner [28] presented the generalization bounds based on the Rademacher complexity and the covering number, respectively. Vapnik [29] gave the generalization bounds based on the Vapnik-Chervonenkis (VC) dimension. Bartlett et al. [1] proposed the local Rademacher complexity and obtained a sharp generalization bound for a particular function class . Hussain and Shawe-Taylor [16] showed improved loss bounds for multiple kernel learning. Zhang [31] analyzed the Bennett-type generalization bounds of the i.i.d. learning process.

It is noteworthy that the aforementioned results of statistical learning theory are all built under the assumption that training and test data are drawn from the same distribution (or briefly called the same-distribution assumption). This assumption may not be valid in the situation that training and test data have different distributions, which will arise in many practical applications including speech recognition [17]

[8]. Domain adaptation has recently been proposed to handle this situation and it is aimed to apply a learning model, trained by using the samples drawn from a certain domain (source domain), to the samples drawn from another domain (target domain) with a different distribution [see 6, 30, 9, 2, 5]. There have been some research works on the theoretical analysis of two types of domain adaptation. In the first type, the learner receives training data from several source domains, known as domain adaptation with multiple sources [see 2, 13, 14, 21, 22, 32]. In the second type, the learner minimizes a convex combination of the source and the target empirical risks, termed as domain adaptation combining source and target data [see 7, 2, 32].

Without loss of generality, this paper is mainly concerned with a more representative (or general) type of domain adaptation, which combines data from multiple sources and one target (or briefly called representative domain adaptation). Evidently, it covers both of the aforementioned two types: domain adaptation with multiple sources and domain adaptation combining source and target. Thus, the results of this paper are more general than the previous works and some of existing results can be regarded as the special cases of this paper [see 32]. We brief the main contributions of this paper as follows.

### 1.1 Overview of Main Results

In this paper, we present a new framework to obtain the generalization bounds of the learning process for representative domain adaptation. Based on the resulting bounds, we then analyze the asymptotical properties of the learning process. There are four major aspects in the framework: (i) the quantity measuring the difference between two domains; (ii) the complexity measure of function classes; (iii) the deviation inequalities for multiple domains; (iv) the symmetrization inequality for representative domain adaptation.

As shown in some previous works [22, 20, 2], one of the major challenges in the theoretical analysis of domain adaptation is to measure the difference between two domains. Different from the previous works, we use the integral probability metric to measure the difference between the distributions of two domains. Moreover, we also give a comparison with the quantities proposed in the previous works.

Generally, in order to obtain the generalization bounds of a learning process, one needs to develop the related deviation (or concentration) inequalities of the learning process. Here, we use a martingale method to develop the related Hoeffding-type, Bennett-type and McDiarmid-type deviation inequalities for multiple domains, respectively. Moreover, in the situation of domain adaptation, since the source domain differs from the target domain, the desired symmetrization inequality for domain adaptation should incorporate some quantity to reflect the difference. From this point of view, we then obtain the related symmetrization inequality incorporating the integral probability metric that measures the difference between the distributions of the source and the target domains.

By applying the derived inequalities, we obtain two types of generalization bounds of the learning process for representative domain adaptation: Hoeffding-type and Bennett-type, both of which are based on the uniform entropy number. Moreover, we use the McDiarmid-type deviation inequality to obtain the generalization bounds based on the Rademacher complexity. It is noteworthy that, based on the relationship between the integral probability metric and the discrepancy distance (or -divergence), the proposed framework can also lead to the generalization bounds by incorporating the discrepancy distance (or -divergence) [see Section 3 and Remark 5.1].

Based on the resulting generalization bounds, we study the asymptotic convergence and the rate of convergence of the learning process for representative domain adaptation. In particular, we analyze the factors that affect the asymptotical behavior of the learning process and discuss the choices of parameters in the situation of representative domain adaptation. The numerical experiments also support our theoretical findings. Meanwhile, we compare our results with the existing results of domain adaptation and the related results under the same-distribution assumption. Note that the representative domain adaption refers to a more general situation that covers both of domain adaptation with multiple sources and domain adaptation combining source and target. Thus, our results include many existing works as special cases. Additionally, our analysis can be applied to analyze the key quantities studied in Mansour et al. [20], Ben-David et al. [2] [see Section 3].

### 1.2 Organization of the Paper

The rest of this paper is organized as follows. Section 2 introduces the problem studied in this paper. Section 3 introduces the integral probability metric and then gives a comparison with other quantities. In Section 4, we introduce the uniform entropy number and the Rademacher complexity. Section 5 provides the generalization bounds for representative domain adaptation. In Section 6, we analyze the asymptotic behavior of the learning process for representative domain adaptation. Section 7 shows the numerical experiments supporting our theoretical findings. We brief the related works in Section 8 and the last section concludes the paper. In Appendix A, we present the deviation inequalities and the symmetrization inequality, and all proofs are given in Appendix B.

## 2 Problem Setup

We denote and as the -th source domain and the target domain, respectively. Set . Let and stand for the distributions of the input spaces and , respectively. Denote and as the labeling functions of () and , respectively.

In representative domain adaptation, the input-space distributions and differ from each other, or and differ from each other, or both cases occur. There are some (but not enough) samples drawn from the target domain in addition to a large amount of i.i.d. samples drawn from each source domain with for any .

Given two parameters and with , denote the convex combination of the weighted empirical risk of multiple-source data and the empirical risk of the target data as:

 Eτw(ℓ∘g):=τE(T)NT(ℓ∘g)+(1−τ)E(S)w(ℓ∘g), (1)

where

is the loss function,

 E(T)NT(ℓ∘g):=1NTNT∑n=1ℓ(g(x(T)n),y(T)n), (2)

and

 E(S)w(ℓ∘g):= K∑k=1wkE(Sk)Nk(ℓ∘g)=K∑k=1wkNkNk∑n=1ℓ(g(x(k)n),y(k)n). (3)

Given a function class , we denote as the function that minimizes the empirical quantity over and it is expected that will perform well on the target expected risk:

 E(T)(ℓ∘g)=∫ℓ(g(x(T)),y(T))dP(z(T)),g∈G, (4)

that is, approximates the labeling function as precisely as possible.

Note that when , such a learning process provides the domain adaptation with multiple sources [see 14, 22, 32]; setting provides the domain adaptation combining source and target data [see 2, 7, 32]; setting and provides the basic domain adaptation with one single source [see 3].

In this learning process, we are mainly interested in the following two types of quantities:

• , which corresponds to the estimation of the expected risk in the target domain

from the empirical quantity ;

• , which corresponds to the performance of the algorithm for domain adaptation it uses,

where is the function that minimizes the expected risk over .

Recalling (1) and (4), since

 Eτw(ℓ∘˜g(T)∗)−Eτw(ℓ∘gτw)≥0,

we have

 E(T)(ℓ∘gτw)= E(T)(ℓ∘gτw)−E(T)(ℓ∘˜g(T)∗)+E(T)(ℓ∘˜g(T)∗) ≤ Eτw(ℓ∘˜g(T)∗)−Eτw(ℓ∘gτw)+E(T)(ℓ∘gτw)−E(T)(ℓ∘˜g(T)∗)+E(T)(ℓ∘˜g(T)∗) ≤ 2supg∈G∣∣E(T)(ℓ∘g)−Eτw(ℓ∘g)∣∣+E(T)(ℓ∘˜g(T)∗),

and thus

 0≤ E(T)(ℓ∘gτw)−E(T)(ℓ∘˜g(T)∗)≤2supg∈G∣∣E(T)(ℓ∘g)−Eτw(ℓ∘g)∣∣.

This shows that the asymptotic behaviors of the aforementioned two quantities, when the sample numbers (or part of them) go to infinity, can both be described by the supremum:

 supg∈G∣∣E(T)(ℓ∘g)−Eτw(ℓ∘g)∣∣, (5)

which is the so-called generalization bound of the learning process for representative domain adaptation.

For convenience, we define the loss function class

 F:={z↦ℓ(g(x),y):g∈G}, (6)

and call as the function class in the rest of this paper. By (1), (2), (3) and (4), we briefly denote for any ,

 E(T)f:=∫f(z(T))dP(z(T)), (7)

and

 Eτwf:= τE(T)NTf+(1−τ)E(S)wf=τE(T)NTf+(1−τ)K∑k=1wkE(Sk)Nkf = τNTNT∑n=1f(z(T)n)+K∑k=1wk(1−τ)NkNk∑n=1f(z(k)n). (8)

Thus, we equivalently rewrite the generalization bound (5) as

## 3 Integral Probability Metric

In the theoretical analysis of domain adaptation, one of main challenges is to find a quantity to measure the difference between the source domain and the target domain , and then one can use the quantity to derive generalization bounds for domain adaptation [see 21, 22, 2, 3]. Different from the existing works [e.g. 21, 22, 2, 3], we use the integral probability metric to measure the difference between and . We also discuss the relationship between the integral probability metric and other quantities proposed in existing works: the -divergence and the discrepancy distance [see 2, 20].

### 3.1 Integral Probability Metric

Ben-David et al. [2, 3] introduced the -divergence to derive the generalization bounds based on the VC dimension under the condition of “-close”. Mansour et al. [20] obtained the generalization bounds based on the Rademacher complexity by using the discrepancy distance. Both quantities are aimed to measure the difference between two input-space distributions and . Moreover, Mansour et al. [22] used the Rényi divergence to measure the distance between two distributions. In this paper, we use the following quantity to measure the difference between the distributions of the source and the target domains:

###### Definition 3.1

Given two domains , let and

be the random variables taking values from

and , respectively. Let be a function class. We define

 DF(S,T):=supf∈F|E(S)f−E(T)f|, (9)

where the expectations and are taken with respect to the distributions of the domains and , respectively.

The quantity

is termed as the integral probability metric that plays an important role in probability theory for measuring the difference between two probability distributions

[see 33, 25, 24, 26]. Recently, Sriperumbudur et al. [27] gave a further investigation and proposed an empirical method to compute the integral probability metric. As mentioned by Müller [24] [see page 432], the quantity is a semimetric and it is a metric if and only if the function class separates the set of all signed measures with . Namely, according to Definition 3.1, given a non-trivial function class , the quantity is equal to zero if the domains and have the same distribution.

By (6), the quantity can be equivalently rewritten as

 DF(S,T)= supg∈G∣∣E(S)ℓ(g(x(S)),y(S))−E(T)ℓ(g(x(T)),y(T))∣∣ = supg∈G∣∣E(S)ℓ(g(x(S)),g(S)∗(x(S)))−E(T)ℓ(g(x(T)),g(T)∗(x(T)))∣∣. (10)

Next, based on the equivalent form (3.1), we discuss the relationship between the quantity and other quantities including the -divergence and the discrepancy distance.

### 3.2 H-Divergence and Discrepancy Distance

Before the formal discussion, we briefly introduce the related quantities proposed in the previous works of Ben-David et al. [2], Mansour et al. [20].

#### 3.2.1 H-Divergence

In classification tasks, by setting as the absolute-value loss function (), Ben-David et al. [2] introduced a variant of the -divergence:

 dH△H(D(S),D(T))=supg1,g2∈H∣∣E(S)ℓ(g1(x(S)),g2(x(S)))−E(T)ℓ(g1(x(T)),g2(x(T)))∣∣

with the condition of “-close”: there exists a such that

 λ≥infg∈G{∫ℓ(g(x(S)),g(S)∗(x(S)))dP(z(S))+∫ℓ(g(x(T)),g(T)∗(x(T)))dP(z(T))}. (11)

One of the main results in Ben-David et al. [2] can be summarized as follows: when or , Ben-David et al. [2] derived the VC-dimension-based upper bounds of

 E(T)ℓ(gτw(x(T)),g(T)∗(x(T)))−E(T)ℓ(˜g(T)∗(x(T)),g(T)∗(x(T))) (12)

by using the summation of , where minimizes the expected risk over [see 2, Theorems 3 4].

There are two points that should be noted:

• as addressed in Section 2, the quantity (12) can be bounded by the generalization bound (5) and thus the analysis presented in this paper can be applied to study (12);

• recalling (11), the condition of “-close” actually places a restriction among the function class and the labeling functions . In the optimistic case, both of and are contained by the function class and are the same, then .

#### 3.2.2 Discrepancy Distance

In both classification and regression tasks, given a function class and a loss function , Mansour et al. [20] defined the discrepancy distance as

 discℓ(D(S),D(T))=supg1,g2∈G∣∣E(S)ℓ(g1(x(S)),g2(x(S)))−E(T)ℓ(g1(x(T)),g2(x(T)))∣∣, (13)

and then used this quantity to obtain the generalization bounds based on the Rademacher complexity. As mentioned by Mansour et al. [20], the quantities and match in the setting of classification tasks with being the absolute-value loss function, while the usage of does not require the “-close” condition. Instead, the authors achieved the upper bound of

 E(T)ℓ(g(x(T)),g(T)∗(x(T)))−E(T)ℓ(˜g(T)∗(x(T)),g(T)∗(x(T))),∀g∈G

by using the summation

 discℓ(D(S),D(T))+1NNS∑n=1ℓ(g(x(S)n),˜g(S)∗(x(S)))+E(S)ℓ(˜g(S)∗(x(S)),˜g(T)∗(x(S))),

where (resp. ) minimizes the expected risk (resp. ) over . It can be equivalently rewritten as follows [see 20, Theorems 8 9]: the upper bound

 E(T)ℓ(g(x(T)),g(T)∗(x(T)))−1NNS∑n=1ℓ(g(x(S)n),˜g(S)∗(x(S))),∀g∈G (14)

can be bounded by using the summation

 (15)

There are also two points that should be noted:

• as addressed above, the quantity (14) can be bounded by the generalization bound (5) and thus the analysis presented in this paper can also be applied to study (14);

• similar to the condition of “-close” [see (11)], the summation (15), in some sense, describes the behaviors of the labeling functions and , because the functions and can be regarded as the approximations of and respectively.

Next, we discuss the relationship between and the aforementioned two quantities: the -divergence and the discrepancy distance. Recalling Definition 3.1, since there is no limitation on the function class , the integral probability metric can be used in both classification and regression tasks. Therefore, we only consider the relationship between the integral probability metric and the discrepancy distance .

### 3.3 Relationship between DF(S,T) and discℓ(D(S),D(T))

From Definition 3.1 and (3.1), the integral probability metric measures the difference between the distributions of the two domains and . However, as addressed in Section 2, if a domain differs from another domain , there are three possibilities: the input-space distribution differs from , or differs from , or both of them occur. Therefore, it is necessary to consider two kinds of differences: the difference between the input-space distributions and and the difference between the labeling functions and . Next, we will show that the integral probability metric can be bounded by using two separate quantities that can measure the difference between and and the difference between and , respectively.

As shown in (13), the quantity actually measures the difference between the input-space distributions and . Moreover, we introduce another quantity to measure the difference between the labeling functions and :

###### Definition 3.2

Given a loss function and a function class , we define

 (16)

Note that if both of the loss function and the function class are non-trivial (or is non-trivial), the quantity is a (semi)metric between the labeling functions and . In fact, it is not hard to verify that satisfies the triangle inequality and is equal to zero if and match.

By combining (3.1), (13) and (16), we have

 discℓ(D(S),D(T))= ≥ supg1∈G∣∣E(S)ℓ(g1(x(S)),g(S)∗(x(S)))−E(T)ℓ(g1(x(T)),g(S)∗(x(T)))∣∣ = supg1∈G∣∣E(S)ℓ(g1(x(S)),g(S)∗(x(S)))−E(T)ℓ(g1(x(T)),g(T)∗(x(T))) +E(T)ℓ(g1(x(T)),g(T)∗(x(T)))−E(T)ℓ(g1(x(T)),g(S)∗(x(T)))∣∣ ≥ supg1∈G∣∣E(S)ℓ(g1(x(S)),g(S)∗(x(S)))−E(T)ℓ(g1(x(T)),g(T)∗(x(T)))∣∣ −supg1∈G∣∣E(T)ℓ(g1(x(T)),g(T)∗(x(T)))−E(T)ℓ(g1(x(T)),g(S)∗(x(T)))∣∣ = DF(S,T)−Q(T)G(g(S)∗,g(T)∗),

and thus

 DF(S,T)≤discℓ(D(S),D(T))+Q(T)G(g(S)∗,g(T)∗), (17)

which implies that the integral probability metric can be bounded by the summation of the discrepancy distance and the quantity , which measure the difference between the input-space distributions and and the difference between the labeling functions and , respectively.

Compared with (11) and (15), the integral probability metric provides a new mechanism to capture the difference between two domains, where the difference between labeling functions and is measured by a (semi)metric .

###### Remark 3.1

As shown in (3.1) and (13), the integral probability metric takes the supremum of over , and the discrepancy distance takes the supremum of and over simultaneously. Consider a specific domain adaptation situation: the labeling function is close to and meanwhile both of them are contained in the function class . In this case, can be very small even though is large. Thus, the integral probability metric is more suitable for such domain adaptation setting than the discrepancy distance.

## 4 Uniform Entropy Number and Rademacher Complexity

In this section, we introduce the definitions of the uniform entropy number and the Rademacher complexity, respectively.

### 4.1 Uniform Entropy Number

Generally, the generalization bound of a certain learning process is achieved by incorporating the complexity measure of function classes, e.g., the covering number, the VC dimension and the Rademacher complexity. The results of this paper are based on the uniform entropy number that is derived from the concept of the covering number and we refer to Mendelson [23] for more details about the uniform entropy number. The covering number of a function class is defined as follows:

###### Definition 4.1

Let be a function class and be a metric on . For any , the covering number of at radius with respect to the metric , denoted by is the minimum size of a cover of radius .

In some classical results of statistical learning theory, the covering number is applied by letting be the distribution-dependent metric. For example, as shown in Theorem 2.3 of Mendelson [23], one can set as the norm and then derives the generalization bound of the i.i.d. learning process by incorporating the expectation of the covering number, that is, . However, in the situation of domain adaptation, we only know the information of source domain, while the expectation is dependent on distributions of both source and target domains because . Therefore, the covering number is no longer applicable to our scheme for obtaining the generalization bounds for representative domain adaptation. In contrast, the uniform entropy number is distribution-free and thus we choose it as the complexity measure of function classes to derive the generalization bounds.

For clarity of presentation, we give some useful notations for the following discussion. For any , given a sample set drawn from the source domain , we denote as the sample set drawn from such that the ghost sample has the same distribution as that of for any and any . Again, given a sample set drawn from the target domain , let be the ghost sample set of . Denote and for any , respectively. Given any and any with , we introduce a variant of the norm: for any ,

It is noteworthy that the variant of the norm is still a norm on the functional space, which can be easily verified by using the definition of norm, so we omit it here. In the situation of representative domain adaptation, by setting the metric as , we then define the uniform entropy number of with respect to the metric as

 lnNw,τ1(F,ξ,2N):=sup{Z2Nk1}Kk=1,¯¯¯Z2NT1lnN(F,ξ,ℓw,τ1({Z2Nk1}Kk=1,¯¯¯¯Z2NT1)) (18)

with .

The Rademacher complexity is one of the most frequently used complexity measures of function classes and we refer to Van der Vaart and Wellner [28], Mendelson [23] for details.

###### Definition 4.2

Let be a function class and be a sample set drawn from . Denote be a set of random variables independently taking either value from with equal probability. The Rademacher complexity of is defined as

 R(F):=Esupf∈F{1NN∑n=1σnf(zn)} (19)

with its empirical version given by

 RN(F):=Eσsupf∈F{1NN∑n=1σnf(zn)},

where stands for the expectation taken with respect to all random variables and , and stands for the expectation only taken with respect to the random variables .

## 5 Generalization Bounds for Representative Domain Adaptation

Based on the uniform entropy number defined in (18), we first present two types of the generalization bounds for representative domain adaptation: Hoeffding-type and Bennett-type, which are derived from the Hoeffding-type deviation inequality and the Bennett-type deviation inequality respectively. Moreover, we obtain the bounds based on the Rademacher complexity via the McDiarmid-type deviation inequality.

### 5.1 Hoeffding-type Generalization Bounds

The following theorem presents the Hoeffding-type generalization bound for representative domain adaptation:

###### Theorem 5.1

Assume that is a function class consisting of the bounded functions with the range . Let and with . Then, given any , we have for any such that

 τ2(b−a)2NT(ξ′)2+K∑k=1(1−τ)2w2k(b−a)2Nk(ξ′)2≤18,

with probability at least ,

 supf∈F∣∣Eτwf−E(T)f∣∣≤(1−τ)D(w)F(S,T)+⎛⎜ ⎜ ⎜ ⎜ ⎜⎝lnNw,τ1(F,ξ′/8,2N)−ln(ϵ/8)132(b−a)2(τ2NT+∑Kk=1(1−τ)2w2kNk)⎞⎟ ⎟ ⎟ ⎟ ⎟⎠12, (20)

where , ,

 ϵ:= 8Nw,τ1(F,ξ′/8,2N)exp⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩−(ξ′)232(b−a)2(τ2NT