Information-theoretic analysis for transfer learning

05/18/2020 ∙ by Xuetong Wu, et al. ∙ 0

Transfer learning, or domain adaptation, is concerned with machine learning problems in which training and testing data come from possibly different distributions (denoted as u and u', respectively). In this work, we give an information-theoretic analysis on the generalization error and the excess risk of transfer learning algorithms, following a line of work initiated by Russo and Zhou. Our results suggest, perhaps as expected, that the Kullback-Leibler (KL) divergence D(u||u') plays an important role in characterizing the generalization error in the settings of domain adaptation. Specifically, we provide generalization error upper bounds for general transfer learning algorithms and extend the results to a specific empirical risk minimization (ERM) algorithm where data from both distributions are available in the training phase. We further apply the method to iterative, noisy gradient descent algorithms, and obtain upper bounds which can be easily calculated, only using parameters from the learning algorithms. A few illustrative examples are provided to demonstrate the usefulness of the results. In particular, our bound is tighter in specific classification problems than the bound derived using Rademacher complexity.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Most machine learning methods focus on the setup where the training and testing data are drawn from the same distribution. Transfer learning, or domain adaptation, is concerned with machine learning problems where training and testing data come from possibly different distributions. This setup is of particular interest in real-world applications, as in many cases we often have easy access to a substantial amount of data from one distribution, on which our learning algorithm trains, but wish to use the learnt hypothesis for data coming from a different distribution, from which we have limited data for training.

Generalization error is defined as the difference between the empirical loss and the population loss (defined as (1) and (2) in Section II) for a given hypothesis, and indicates if the hypothesis has been overfitted (or underfitted). Recently, [10] proposed an information-theoretic framework for analyzing generalization error of learning algorithms, and showed that the mutual information between the training data and the output hypothesis can be used to upper bound the generalization error. One nice property of this framework is that the mutual information bound explicitly explores the dependence between training data and the output hypothesis, in contrast to the bounds obtained by traditional methods with VC dimension and Rademacher complexity [12]. As pointed out by [1]

, the information-theoretic upper bound could be substantially tighter than the traditional bounds if we could exploit specific properties of the learning algorithm. While upper bounds on generalization error are classical results in statistical learning theory, only a relatively small number of papers are devoted to this problem for transfer learning algorithms. To mention a few, Ben-David

et al.[2] gave VC dimension-style bounds for classification problems. Blitzer et al.[3] and Zhang[16] studied similar problems and obtained upper bounds in terms of Rademacher complexity. Specific error bounds for particular transfer learning algorithms and loss metrics are investigated in [6] and [17]. Long et al.[8] developed a more general framework for transfer learning where the error is bounded with the distribution difference and output hypothesis adaptability.

Compared with traditional learning problems, the generalization error of transfer learning additionally takes the distribution divergence between the source and target into account and how to evaluate this "domain shift" is non-trivial. We exploit the information-theoretic framework in the transfer learning settings to address this issue following the information-theoretic framework studied by [10], [15] and [5]. The main contributions are summarized as follows.

  • We give an information-theoretic upper bound on the generalization error of transfer learning algorithms where training and testing data come from different distributions and KL-divergence between the source and target distribution captures the effect of domain shift.

  • We give upper bounds to the excess risk of a specific ERM algorithm where data from both distributions are available to the learning algorithm. Our example shows that our bound is tighter than the existing bounds in specific classification problems which depend on the Rademacher complexity of the hypothesis space, as our bounds are data-algorithm dependent.

  • We further develop generalization error and excess risk upper bounds for noisy, iterative gradient descent algorithms. The results are useful in the sense that the bounds on the mutual information can be easily calculated only using parameters from the optimization algorithms.

Ii Problem formulation and main results

We consider an instance space , a hypothesis space

and a non-negative loss function

. Let and

be two probability distributions defined on

, and assume that is absolute continuous with respect to (). In the sequel, the distribution is referred to as the source distribution, and as the target distribution. We are given a set of training data . More precisely, for a fixed number , we assume that the samples are drawn IID from the target distribution, and the samples are drawn IID from the source distribution.

In the setup of transfer learning, a learning algorithm is a (randomized) mapping from the training data to a hypothesis , characterized by a conditional distribution , with the goal to find a hypothesis that minimizes the population risk with respect to the target distribution

(1)

where is distributed according to . Notice that corresponds to the important case when we do not have any samples from the target distribution. Obviously, takes us back to the classical setup where training data comes from the same distribution as test data, which is not our focus.

Ii-a Empirical risk minimization

In this section, we focus on one particular empirical risk minimization (ERM) algorithm. For a hypothesis , the empirical risk of on a training sequence is defined as

(2)

Given samples and from both distributions, it is natural to form an empirical risk function as a convex combination of the empirical risk induced by and [2] defined as

for some weight parameter to be determined. We define as the ERM solution, and also define the optimal hypothesis (with respect to the distribution ) as .

We are interested in two quantities for this ERM algorithm. The first one is the generalization error defined as

(3)

namely the difference between the minimized empirical risk and the population risk of the ERM solution under the target distribution. We are also interested in the excess risk as

which is the difference between the population risk of compared to that of the optimal hypothesis. Notice that the excess risk is related to the generalization error via the following upper bound

(4)

where we have used the fact by the definition of . For any , the quantity in the above expression is defined as

Ii-B Upper bound on generalization errors

We view the ERM solution

as a random variable induced by the random samples

and the (possibly random) ERM algorithm, characterized by a conditional distribution . We will first study the expectation of the generalization error

(5)

where the expectation is taken with respect to the distribution defined as

Furthermore we use to denote the marginal distribution of

induced by the joint distribution

.

Following the characterization used in [5], the following theorem provides an upper bound on the expectation of the generalization error in terms of the mutual information between individual samples and the any solution , as well as the KL-divergence between the source and target distributions. As pointed out in [5], using mutual information between the hypothesis and individual samples in general gives a tighter upper bounds than using .

Theorem 1 (Generalization error of ERM).

Assume that the cumulant generating function of the random variable is upper bounded by in the interval under the product distribution for some and . Then for any , the expectation of the generalization error in (5) is upper bounded as

where we define

All the proofs in this paper can be found in [14].

Remark 1.

In fact, the bound above is not specific to the ERM algorithm, but applicable to any hypothesis generated by a learning algorithm characterized by the conditional distribution (see proofs in [14] for more details).

From a stability point of view [4], good algorithms (ERM, for example) should ensure that vanishes as . On the other hand, the domain shift is reflected in the KL-divergence , as this term does not vanish when goes to infinity.

Optimizing in the above expression is non-trivial as inexplicitly involves . However, if we care about the generalization error with respect to the population risk under the target distribution for (the number of samples from the target distribution also goes to infinity), the intuition says that we should choose , i.e. only using from the target domain in the training process. On the other hand, if we only have limited data samples, can be set to be as suggested in [16, 2] that this choice is shown to achieve the faster convergence rate and tighter bound. Overall, we suggest that should approach 1 with increasing, say, .

The result in Theorem 1 does not cover the case (no samples from the target distribution). However, it is easy to see that in this case we should choose in our ERM algorithm, and a corresponding upper bound is given as in the following corollary under generic hypothesis.

Corollary 1 (Generalization error with source only).

Let so that we only have samples from the source distribution . Let be the conditional distribution characterizing the learning algorithm which maps samples to a hypothesis .(In particular, W is not necessarily the same as ). Under the assumptions in Theorem 1, the expected generalization error of is upper bounded as

If the loss function is -subgaussian, namely

for any under the distribution , the bound in Theorem 1 can be further simplified with . In particular, if the loss function takes value in , then is -subgaussian. We give the following corollary for the subgaussian loss function.

Corollary 2 (Generalization error for subgaussian loss functions).

If is -subgaussian under the distribution , then the expectation of the generalization error of the ERM solution in (5) is upper bounded as

If , for any hypothesis (not necessarily the ERM solution) induced by and a learning algorithm , we have the upper bound

(6)

The above result follows directly from Corollary 1 and by noticing that we can set with the assumption that is -subgaussian.

Remark 2.

Using the chain rule of mutual information and the fact that

’s are IID, we can relax the upper bound in (6) as

which recovers the result in the [15] if . Moreover, we see that the effect of the “change of domain" is simply captured by the KL divergence between the source and the target distribution.

Ii-C Upper bound on the excess risk of ERM

In this section we focus on the case and give a data-dependent upper bound on the excess risk defined in (4). To do this, we first define a distance quantity between the two divergent distributions as

(7)

The following theorem gives a bound for the excess risk.

Theorem 2 (Excess risk of ERM).

Assume that for any , the loss function is -subgaussian under the distribution . Then for any and , there exists an (depending on and ) such that for all , the following inequality holds with probability at least (over the randomness of samples and the learning algorithm),

(8)

Furthermore in the case when (no samples from the distribution ), the inequality becomes

Note that is normally known as the integral probability metric, which is challenging to evaluate. Sriperumbudur et al.[13]

investigated the data-dependent estimation to compute the quantity using Kantorovich metric, Dudley metric and kernel distance, respectively. Ben-David

et al.[2] proposed another evaluation method to resolve the issue for classification problem. We point out that the result in Theorem 2 is not effective for a class of supervised machine learning problems if is not absolutely continuous with respect to . Specifically when the label is a deterministic function of the features , the KL divergence is , leading to a vacuous bound. To develop an appropriate upper bound to handle such scenarios, we follow the methods in [7] to extend the results by using other types of -divergence. In particular, we choose , which do not impose the absolute continuity restriction.

Corollary 3.

(Generalization error bound of ERM using -divergence) Assume that for any , the loss function is -norm bounded by under the distribution . Then for any and , there exists an (depending on and ) such that for all , the following inequality holds with probability at least (over the randomness of samples and the learning algorithm) that

where is the -divergence between the distribution and with and denotes the total variation distance between the distribution and .

Ii-D Generalization error bound for noisy gradient descent algorithm

The upper bound obtained in previous section cannot be evaluated directly as it depends on the distribution of the data, which is in general assumed unknown in learning problems. Furthermore, in most cases, does not have a closed-form solution, but obtained by using an optimization algorithm. In this section, we study the class of optimization algorithms that iteratively update its optimization variable based on both source and target dataset . The upper bound derived in this section are useful in the sense that the bound can be easily calculated if the relative learning parameters are given. Specifically, the hypothesis is represented by the optimization variable of the optimization algorithm, and we use to denote the variable at iteration . In particular, we consider the following noisy gradient descent algorithm

(9)

where is initialized to be arbitrarily, denotes the gradient of with respect to , and can be any noises with the mean value of

and variance of

. A typical example is .

Theorem 3 (Generalization error of noisy gradient descent).

Assume that is obtained from (9) at iteration, and assume that is over , and the gradient is bounded, e.g., for any . then

(10)

where we define

(11)

In this bound, we observe that if the optimization parameters (such as

) and loss function are fixed, the generalization error bound is easy to calculate by using the parameters given above. Also note that our assumptions do not require that the noise is Gaussian distributed or the loss function

is convex, this generality provides a possibility to tackle a wider range of optimization problems. However, in many cases can not be directly applied to bound the excess risk where (4) does not generally hold. One can further provide an excess risk upper bound by utilizing the proposition 3 in [11] with the assumption of strongly convex loss function, which guarantees the convergence of hypothesis.

Iii Examples

In this section, we provide two simple examples to illustrate the upper bounds we obtained in previous sections.

Iii-a Estimating the mean of Gaussian

We consider an example studied in [5]. Assume that comes from the source distribution and comes form the target distribution where . We define the loss function as

For simplicity we assume here that . The empirical risk minimization (ERM) solution is obtained by minimizing , where the solution is given by

To obtain the upper bound, we first notice that in this case

for all . It is easy to see that the loss function

is non-central chi-square distribution

of degree of freedom with the variance of . Furthermore, the cumulant generating function can be bounded as

By Corollary 1, the generalization error bound is given as

By the definition of ,

We set and substitute in the generalization error above, we reach

where . In this case, the generalization error of can be calculated exactly to be

The derived bound approaches as with a decay rate . The derived bound captures the bound asymptotically well with a lower rate, which is often the results using Rademacher complexity bound[16].

Iii-B Logistic regression transfer

In this section, we apply our bound in a typical classification problem. Consider the following logistic regression problem in a 2-dimensional space shown in Figure 

1. For each and , the loss function is given by

where .

Fig. 1: The source data are sampled from the truncated Gaussian distribution while the target data are sampled from the truncated Gaussian distribution . The according label

, is generated from the Bernoulli distribution with probability

, where for the source and for the target.

Here we truncate the Gaussian random variables , for . We also restrict hypothesis space as where falls in this area with high probability. It can be easily checked that and the loss function is bounded, hence we can upper bound generalization error using Corollary 2. To this end, we firstly fix the source samples , while the target samples varies from 100 to 100000 and following the guideline from [2, 16]. We give the empirical estimation for within the according hypothesis space such that

To evaluate the mutual information efficiently, we follow the work [9] by repeatedly generating and . As , we decompose in terms of the feature distributions and conditional distributions of the labels. The first term can be calculated using the parameters of Gaussian distributions. The latter term denotes the expected KL-divergence over between two Bernoulli distributions, which can be evaluated by generating abundant samples from the source domain. Further we apply Theorem 2 to upper bound the excess risk, where we give a data-dependent estimation for the term as

To demonstrate the usefulness of our algorithm, we compare the bound in the following theorem using the Rademacher complexity under the same domain adaptation framework. Detailed experiment settings can be found in [14].

Theorem 4.

(Generalization error of ERM with Rademacher complexity) [16, Theorem 6.2] Assume that for any , the loss function is -subgaussian under the distribution or . Then for any , the following inequality holds with probability at least (over the randomness of samples and the learning algorithm)

where is randomly selected from {-1,+1}.

The comparisons of generalization error bound and excess risk bound are shown in figure 2. It is obvious that the true losses are bounded by our developed upper bounds. The result also suggests that our bound is tighter than Rademacher complexity bound in terms of both generalization error and excess risk. This is possibly due to that the generalization error bound with Rademacher complexity is characterized by the domain difference in the whole hypothesis space, while our bound is data-algorithm dependent, which is only concerned with . As expected, the data-algorithm dependent bound captures the true behaviour of generalization error while Rademacher complexity bound fails to do so. It is noteworthy that both bounds converge as increases. The result confirms that the bounds captures the dependence of the input data and output hypothesis, as well as the stochasticity of the algorithm.

Fig. 2: Comparisons for generalization error and excess risk

References

  • [1] A. Asadi, E. Abbe, and S. Verdu (2018) Chaining Mutual Information and Tightening Generalization Bounds. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 7234–7243. Cited by: §I.
  • [2] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan (2010-05) A theory of learning from different domains. Machine Learning 79 (1), pp. 151–175 (en). External Links: ISSN 1573-0565 Cited by: §I, §II-A, §II-B, §II-C, §III-B.
  • [3] J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman (2008) Learning Bounds for Domain Adaptation. In Advances in Neural Information Processing Systems 20, J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis (Eds.), pp. 129–136. Cited by: §I.
  • [4] O. Bousquet and A. Elisseeff (2002) Stability and Generalization. Journal of Machine Learning Research 2 (Mar), pp. 499–526. External Links: ISSN ISSN 1533-7928 Cited by: §II-B.
  • [5] Y. Bu, S. Zou, and V. V. Veeravalli (2019-01) Tightening Mutual Information Based Bounds on Generalization Error. arXiv:1901.04609 [cs, stat]. Note: arXiv: 1901.04609Comment: Submitted to ISIT 2019 External Links: Link Cited by: §I, §II-B, §III-A.
  • [6] W. Dai, Q. Yang, G. Xue, and Y. Yu (2007) Boosting for transfer learning. In Proceedings of the 24th international conference on Machine learning, pp. 193–200. Cited by: §I.
  • [7] J. Jiao, Y. Han, and T. Weissman (2017-06) Dependence measures bounding the exploration bias for general measurements. In 2017 IEEE International Symposium on Information Theory (ISIT), pp. 1475–1479. Cited by: §II-C.
  • [8] M. Long, J. Wang, G. Ding, S. J. Pan, and S. Y. Philip (2013) Adaptation regularization: a general framework for transfer learning. IEEE Transactions on Knowledge and Data Engineering 26 (5), pp. 1076–1089. Cited by: §I.
  • [9] R. Moddemeijer (1989) On estimation of entropy and mutual information of continuous distributions. Signal processing 16 (3), pp. 233–248. Cited by: §III-B.
  • [10] D. Russo and J. Zou (2016-05) Controlling Bias in Adaptive Data Analysis Using Information Theory. In Artificial Intelligence and Statistics, pp. 1232–1240 (en). Cited by: §I, §I.
  • [11] M. Schmidt, N. L. Roux, and F. R. Bach (2011) Convergence rates of inexact proximal-gradient methods for convex optimization. In Advances in neural information processing systems, pp. 1458–1466. Cited by: §II-D.
  • [12] S. Shalev-Shwartz and S. Ben-David (2014-05) Understanding Machine Learning by Shai Shalev-Shwartz. Cambridge Core (en). Cited by: §I.
  • [13] B. K. Sriperumbudur, K. Fukumizu, A. Gretton, B. Schölkopf, G. R. Lanckriet, et al. (2012) On the empirical estimation of integral probability metrics. Electronic Journal of Statistics 6, pp. 1550–1599. Cited by: §II-C.
  • [14] Supplementary materials - proofs. External Links: Link Cited by: §II-B, §III-B, Remark 1.
  • [15] A. Xu and M. Raginsky (2017) Information-theoretic analysis of generalization capability of learning algorithms. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 2524–2533. Cited by: §I, Remark 2.
  • [16] C. Zhang, L. Zhang, and J. Ye (2012) Generalization bounds for domain adaptation. In Advances in neural information processing systems, pp. 3320–3328. Cited by: §I, §II-B, §III-A, §III-B, Theorem 4.
  • [17] Y. Zhang, T. Liu, M. Long, and M. I. Jordan (2019) Bridging theory and algorithm for domain adaptation. arXiv preprint arXiv:1904.05801. Cited by: §I.