## 1 Introduction

Traditional machine learning implicitly assumes training and test data are drawn from the same distribution. However, mismatches between training and test distributions occur frequently in reality. For example, in clinical trials the patients used for prognostic factor identification may not come from the target population due to sample selection bias

[12, 9]; incoming signals used for natural language and image processing, bioinformatics or econometric analyses change in distribution over time and seasonality [11, 36, 26, 20, 31, 13, 4]; patterns for engineering controls fluctuate due to the non-stationarity of environments [25, 10].Many such problems are investigated under the covariate shift assumption [23]

. Namely, in a supervised learning setting with covariate

and label , the marginal distribution of in the training set , shifts away from the marginal distribution of the test set , while the conditional distributionremains invariant in both sets. Because test labels are either too costly to obtain or unobserved, it could be uneconomical or impossible to build predictive models only on the test set. In this case, one is obliged to utilize the invariance of conditional probability to adapt or transfer knowledge from the training set, termed as transfer learning

[17] or domain adaptation [13, 3]. Intuitively, to correct for covariate shift (i.e., cancel the bias from the training set), one can reweight the training data by assigning more weights to observations where the test data locate more often. Indeed, the key to many approaches addressing covariate shift is the estimation of importance sampling weights, or the Radon-Nikodym derivative (RND) of between and [27, 1, 14, 5, 34, 18, 22, 20, 25]. Among them is the popular kernel mean matching (KMM) [12, 20], which estimates the importance weights by matching means in a reproducing kernel Hilbert space (RKHS) and can be implemented efficiently by quadratic programming (QP).Despite the demonstrated efficiency in many covariate shift problems [27, 20, 9], KMM can suffer from high variance, due to several reasons. The first one regards the RKHS assumption. As pointed out in [35], under a more realistic assumption from learning theory [6], when the true regression function does not lie in the RKHS but a general range space indexed by a smoothness parameter , KMM degrades to sub-canonical rate from the parametric rate . Second, if the discrepancy between the training and testing distributions is large (e.g., test samples concentrate on regions where few training samples are located), the RND becomes unstable and leads to high resulting variance [2], partially due to an induced sparsity as most weights shrink towards zero while the non-zero ones surge to huge values. This is an intrinsic challenge for reweighting methods that occurs even if the RND is known in closed-form. One way to bypass it is to identify model misspecification [33], but as mentioned in [28], the cross-validation for model selection needed in many related methods often requires the importance weights to cancel biases and the necessity for reweighting remains.

In this paper we propose a method to reduce the variance of KMM in covariate shift problems. Our method relies on an estimated regression function and the application of the importance weighting on the residuals of the regression. Intuitively, these residuals have smaller magnitudes than the original loss values, and the resulting reweighted estimator thus becomes less sensitive to the variances of weights. Then, we cancel the bias incurred by the use of residuals by a judicious compensation through the estimated regression function evaluated on the test set.

We specialize our method by using a nonparametric regression (NR) function constructed from regularized least square in RKHS [6, 24, 29], also known as the Tikhonov regularized learning algorithm [7]. We show that our new estimator achieves the rate , which is superior to the best-known rate of KMM in [35], with the same computational complexity of KMM. Although the gap to the parametric rate is yet to be closed, the new estimator certainly seems to be a step towards the right direction. To put into perspective, we also compare with an alternate approach in [35] which constructs a NR function using the training set and then predicts by evaluating on the test set. Such an approach leads to a better dependence on the test size but worse dependence on the training size than KMM. Our estimator, which can be viewed as an ensemble of KMM and NR, achieves a convergence rate that is either superior or matches both of these methods, thus in a sense robust against both estimators. In fact, we show our estimator can be motivated both from a variance reduction perspective on KMM using control variates [16, 8] and a bias reduction perspective on NR.

Another noticable feature of the new estimator relates to data aggregation in empirical risk minimization (ERM). Specifically, when KMM is applied in learning algorithms or ERMs, the resulting optimal solution is typically a finite-dimensional span of the training data mapped into feature space [21]. The optimal solution of our estimator, on the other hand, depends on both the training and testing data, thus highlighting a different and more efficient information leveraging that utilizes both data sets simultaneously.

The paper is organized as follows. Section 2 reviews the background on KMM and NR that motivates our estimator. Section 3 presents the details of our estimator and studies its convergence property. Section 4 generalizes our method to ERM. Section 5 demonstrates experimental results.

## 2 Background and Motivation

### 2.1 Assumptions

Denote to be the probability measure for training variables and for test variables .

###### Assumption 1.

.

###### Assumption 2.

The Radon-Nikodym derivative exists and is bounded by .

###### Assumption 3.

The covariate space is compact and the label space . Furthermore, there exists a kernel which induces a RKHS and a canonical feature map such that and for some .

Assumption 1 is the covariate shift assumption which states the conditional distribution remains invariant while the marginal and differ. Assumptions 2 and 3 are common for establishing theoretical results. Specifically, Assumption 2 can be satisfied by restricting the support of and on a compact set, although could be potentially large.

### 2.2 Problem Setup and Existing Approaches

Given labelled training data and unlabelled test data (i.e., are unavailable), the goal is to estimate . The KMM estimator [12, 9] is

where are solutions of a QP that attempts to match the means of training and test sets in the feature space using weights :

(1) | ||||

Notice we write as in informally to highlight as estimates of . The fact that (1) is a QP can be verified by the kernel trick, as in [9]. Indeed, define matrix and , optimization (1) is equivalent to

(2) | ||||

In practice, a constraint for a tolerance is included to regularize the towards the RND. As in [35], we omit them to simplify analysis. On the other hand, the NR estimator

is based on , some estimate of the regression function . Notice the conditional expectation is taken regardless of or . Here, we consider a that is estimated nonparametrically by regularized least square in RKHS:

(3) |

where is a regularization term to be chosen and the subscript represents . Using the representation theorem [21], optimization problem (3) can be solved in closed form with where

(4) |

and .

### 2.3 Motivation

Depending on properties of , [35] proves different rates of KMM. The most notable case is when but rather , where is the integral operator on . In this case, [35] characterize with the approximation error

(5) |

and the rates of KMM drops to sub-canonical , as opposed to when . As shown in Lemma 4 in the Appendix and Theorem 4.1 of [6]), (5) is almost equivalent to : implies (5) while (5) leads to for any . We adopt the characterization as our analysis is based on related learning theory estimates. In particular, our proofs rely on these estimates and are different from [35]. For example, in (3), is used as a free parameter for controlling , whereas [35] uses the parameter in (5). Although the two approaches are equivalent from an optimization viewpoint, with being the Lagrange dual variable, the former approach turns out to be more suitable to analyse .

Correspondingly, the convergence rate for when is also shown in [35] as , with taken as in (3) and chosen optimally. The rate of is usually better than due to labelling cost (i.e. ). However, in practice the performance of is not always better than . This could be partially explained by the hidden dependence of on potentially large , but more importantly, without variance reduction, KMM is subject to the negative effects of unstable importance sampling weights (i.e. the ). On the other hand, the training of requires labels hence can only be done on training set. Consequently, without reweighting, when estimating the test quantity , the rate of suffers from the bias.

This motivates the search for a robust estimator which does not require prior knowledge on the performance of or and can, through a combination, reach or even surpass the best performance among both. For simplicity, we use the mean squared error (MSE) criteria and assume an additive model where is independent with and other errors. Under this framework, we motivate a remedy from two perspectives:

#### Variance Reduction for KMM:

Consider an idealized KMM with and being the true RND. Since

is unbiased and the only source of MSE becomes the variance. It then follows from standard control variates that, given an estimator

and a zero-mean random variable

, we can set and use to obtainwithout altering the mean of . Thus we can use

with , suppose we have seperately acquired a . To calculate , suppose and are independent, then we have

if is close enough to . On the other hand, in the usual case where ,

Thus, which gives our estimator

#### Bias Reduction for NR:

Consider the NR estimator . Assuming again the common case where , we have

and the main source of MSE is bias . If we add to , we eliminate the bias which gives the same estimator

## 3 Robust Estimator

We construct a new estimator that can be shown to perform robustly against both KMM and NR estimators discussed above. In our construction, we split the training set with a proportion , i.e., divide into

and

where is used to solve for the weight in (1) and is used to train an NR function for some as in (3). Finally, we define our estimator as

(6) |

First, we remark the parameter controlling the splitting of data serves mainly for theoretical considerations. In practice, the data can be used for both purposes simultaneously. Second, as mentioned, many other than (3) could be considered for control variate. However, aside from the availability of closed-form expressions (4), is connected to the learning theory estimates [6]. Thus, for establishing a theoretical bound, we focus on for now.

Our main result is the convergence analysis with respect to and which rigorously justified the previous intuition. In particular, we show that either surpasses or achieves the better rate between and . In all theorems that follow, the big- notations can be interpreted either as high probability bound or a bound on expectation. The proofs are left in the Appendix.

###### Theorem 1.

###### Corollary 1.

We remark several implications. First, although not achieving canonical, (7) is an improvement over the best-known rate of when , especially for small , suggesting that is more suitable than when is irregular. Indeed, is a smoothness parameter that measures the regularity of . When increases, functions in get smoother and for , with the limiting case that , and (i.e. ) for universal kernels by Mercer’s theorem.

Second, as in Theorem 4 of [35], the optimal tuning of that leads to (7) depends on the unknown parameter , which may not be adaptive in practice. However, if one simply choose , still achieves a rate no worse than as depicted in (8).

Third, also in Theorem 4 of [35], the rate of is when , which is better on but not . Since usually , the rate of generally excels. Indeed, in this case the rate of beats only if . However, if so, can still achieve rate in (9) which is better than , by simply taking , i.e., regularizing the training process more when the test set is small. Moreover, as , our estimator recovers the canonical rate as opposed to in .

Thus, in summary, when , our estimator outperforms both and across the relative sizes of and . The outperformance over is strict when is chosen dependent on , and the performance is matched when is chosen robustly without knowledge of .

For completeness, we consider two other characterizations of discussed in [35]: one is and the other is for some (e.g., with being the Gaussian kernel, where is the Sobolev space with integer ). The two assumptions are, in a sense, more extreme (being optimistic or pessimistic). The next two results show that the rates of in these situations match the existing ones for (the rates for are not discussed in [35] under these assumptions).

###### Proposition 1.

## 4 Empirical Risk Minimization

The robust estimator can handle empirical risk minimization (ERM). Given loss function

given in , we optimize overwhere to find

In practice, usually a regularization term on is added. For example, the KMM in [12] considers

(10) |

We can carry out a similar modification for :

(11) |

with based on and being an estimate of based on . For later reference, we note that a similar modification can also be used to utilize :

(12) |

We discuss two classical learning problems by (4).

#### Penalized Least Square Regression:

Consider a regression problem with , and . We have

and a candidate for is to substitute with . Then, (4) becomes

by adding and removing the components not involving . Furthermore, it simplifies to the QP:

(13) |

by the representation theorem [21]. Here and where , , , for and , , , for . Notice (4) has a closed-form solution

#### Penalized Logistic Regression:

Consider a binary classification problem with , and . Thus, we have

and we can again substitute with . Then, (4) becomes

which again simplifies to, by [21], the convex program:

(14) |

Both (4) and (4) can be optimized efficiently by standard solvers. Notably, derived from (4), an optimal solution is in the form which spans on both training and test data. In contrast, the solution of (10) or (12) only spans on one of them. For example, as shown in [12], the penalized least square solution for (10) is where

Comments

There are no comments yet.