# Robust Importance Weighting for Covariate Shift

In many learning problems, the training and testing data follow different distributions and a particularly common situation is the covariate shift. To correct for sampling biases, most approaches, including the popular kernel mean matching (KMM), focus on estimating the importance weights between the two distributions. Reweighting-based methods, however, are exposed to high variance when the distributional discrepancy is large and the weights are poorly estimated. On the other hand, the alternate approach of using nonparametric regression (NR) incurs high bias when the training size is limited. In this paper, we propose and analyze a new estimator that systematically integrates the residuals of NR with KMM reweighting, based on a control-variate perspective. The proposed estimator can be shown to either strictly outperform or match the best-known existing rates for both KMM and NR, and thus is a robust combination of both estimators. The experiments shows the estimator works well in practice.

## Authors

• 21 publications
• 3 publications
• 1 publication
• ### Robust Covariate Shift Prediction with General Losses and Feature Views

Covariate shift relaxes the widely-employed independent and identically ...
12/28/2017 ∙ by Anqi Liu, et al. ∙ 0

• ### On reducing sampling variance in covariate shift using control variates

Covariate shift classification problems can in principle be tackled by i...
10/17/2017 ∙ by Wouter Kouw, et al. ∙ 0

• ### Optimal P-value Weighting with Independent Information

The large-scale multiple testing inherent to high throughput biological ...
12/19/2017 ∙ by Mohamad S. Hasan, et al. ∙ 0

• ### Doubly Robust Covariate Shift Regression with Semi-nonparametric Nuisance Models

Importance weighting is naturally used to adjust for covariate shift. Ho...
10/06/2020 ∙ by Molei Liu, et al. ∙ 0

• ### Dimension Reduction for Robust Covariate Shift Correction

In the covariate shift learning scenario, the training and test covariat...
11/29/2017 ∙ by Fulton Wang, et al. ∙ 0

• ### Triply Robust Off-Policy Evaluation

We propose a robust regression approach to off-policy evaluation (OPE) f...
11/13/2019 ∙ by Anqi Liu, et al. ∙ 37

• ### Robust Correction of Sampling Bias Using Cumulative Distribution Functions

Varying domains and biased datasets can lead to differences between the ...
10/23/2020 ∙ by Bijan Mazaheri, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Traditional machine learning implicitly assumes training and test data are drawn from the same distribution. However, mismatches between training and test distributions occur frequently in reality. For example, in clinical trials the patients used for prognostic factor identification may not come from the target population due to sample selection bias

[12, 9]; incoming signals used for natural language and image processing, bioinformatics or econometric analyses change in distribution over time and seasonality [11, 36, 26, 20, 31, 13, 4]; patterns for engineering controls fluctuate due to the non-stationarity of environments [25, 10].

Many such problems are investigated under the covariate shift assumption [23]

. Namely, in a supervised learning setting with covariate

and label , the marginal distribution of in the training set , shifts away from the marginal distribution of the test set , while the conditional distribution

remains invariant in both sets. Because test labels are either too costly to obtain or unobserved, it could be uneconomical or impossible to build predictive models only on the test set. In this case, one is obliged to utilize the invariance of conditional probability to adapt or transfer knowledge from the training set, termed as transfer learning

[17] or domain adaptation [13, 3]. Intuitively, to correct for covariate shift (i.e., cancel the bias from the training set), one can reweight the training data by assigning more weights to observations where the test data locate more often. Indeed, the key to many approaches addressing covariate shift is the estimation of importance sampling weights, or the Radon-Nikodym derivative (RND) of between and [27, 1, 14, 5, 34, 18, 22, 20, 25]. Among them is the popular kernel mean matching (KMM) [12, 20], which estimates the importance weights by matching means in a reproducing kernel Hilbert space (RKHS) and can be implemented efficiently by quadratic programming (QP).

Despite the demonstrated efficiency in many covariate shift problems [27, 20, 9], KMM can suffer from high variance, due to several reasons. The first one regards the RKHS assumption. As pointed out in [35], under a more realistic assumption from learning theory [6], when the true regression function does not lie in the RKHS but a general range space indexed by a smoothness parameter , KMM degrades to sub-canonical rate from the parametric rate . Second, if the discrepancy between the training and testing distributions is large (e.g., test samples concentrate on regions where few training samples are located), the RND becomes unstable and leads to high resulting variance [2], partially due to an induced sparsity as most weights shrink towards zero while the non-zero ones surge to huge values. This is an intrinsic challenge for reweighting methods that occurs even if the RND is known in closed-form. One way to bypass it is to identify model misspecification [33], but as mentioned in [28], the cross-validation for model selection needed in many related methods often requires the importance weights to cancel biases and the necessity for reweighting remains.

In this paper we propose a method to reduce the variance of KMM in covariate shift problems. Our method relies on an estimated regression function and the application of the importance weighting on the residuals of the regression. Intuitively, these residuals have smaller magnitudes than the original loss values, and the resulting reweighted estimator thus becomes less sensitive to the variances of weights. Then, we cancel the bias incurred by the use of residuals by a judicious compensation through the estimated regression function evaluated on the test set.

We specialize our method by using a nonparametric regression (NR) function constructed from regularized least square in RKHS [6, 24, 29], also known as the Tikhonov regularized learning algorithm [7]. We show that our new estimator achieves the rate , which is superior to the best-known rate of KMM in [35], with the same computational complexity of KMM. Although the gap to the parametric rate is yet to be closed, the new estimator certainly seems to be a step towards the right direction. To put into perspective, we also compare with an alternate approach in [35] which constructs a NR function using the training set and then predicts by evaluating on the test set. Such an approach leads to a better dependence on the test size but worse dependence on the training size than KMM. Our estimator, which can be viewed as an ensemble of KMM and NR, achieves a convergence rate that is either superior or matches both of these methods, thus in a sense robust against both estimators. In fact, we show our estimator can be motivated both from a variance reduction perspective on KMM using control variates [16, 8] and a bias reduction perspective on NR.

Another noticable feature of the new estimator relates to data aggregation in empirical risk minimization (ERM). Specifically, when KMM is applied in learning algorithms or ERMs, the resulting optimal solution is typically a finite-dimensional span of the training data mapped into feature space [21]. The optimal solution of our estimator, on the other hand, depends on both the training and testing data, thus highlighting a different and more efficient information leveraging that utilizes both data sets simultaneously.

The paper is organized as follows. Section 2 reviews the background on KMM and NR that motivates our estimator. Section 3 presents the details of our estimator and studies its convergence property. Section 4 generalizes our method to ERM. Section 5 demonstrates experimental results.

## 2 Background and Motivation

### 2.1 Assumptions

Denote to be the probability measure for training variables and for test variables .

.

###### Assumption 2.

The Radon-Nikodym derivative exists and is bounded by .

###### Assumption 3.

The covariate space is compact and the label space . Furthermore, there exists a kernel which induces a RKHS and a canonical feature map such that and for some .

Assumption 1 is the covariate shift assumption which states the conditional distribution remains invariant while the marginal and differ. Assumptions 2 and 3 are common for establishing theoretical results. Specifically, Assumption 2 can be satisfied by restricting the support of and on a compact set, although could be potentially large.

### 2.2 Problem Setup and Existing Approaches

Given labelled training data and unlabelled test data (i.e., are unavailable), the goal is to estimate . The KMM estimator [12, 9] is

 VKMM=1ntrntr∑j=1^β(xtrj)ytrj,

where are solutions of a QP that attempts to match the means of training and test sets in the feature space using weights :

 min^β{^L(^β)≜∥∥1ntrntr∑j=1^βjΦ(xtrj)−1ntente∑i=1Φ(xtei)∥∥2H} (1) s.t. 0≤^βj≤B,∀1≤j≤ntr.

Notice we write as in informally to highlight as estimates of . The fact that (1) is a QP can be verified by the kernel trick, as in [9]. Indeed, define matrix and , optimization (1) is equivalent to

 min^β1n2tr^βTK^β−2n2trκT^β, (2) s.t. 0≤^βj≤B,∀1≤j≤ntr.

In practice, a constraint for a tolerance is included to regularize the towards the RND. As in [35], we omit them to simplify analysis. On the other hand, the NR estimator

 VNR=1ntente∑i=1^g(xtei),

is based on , some estimate of the regression function . Notice the conditional expectation is taken regardless of or . Here, we consider a that is estimated nonparametrically by regularized least square in RKHS:

 ^gγ,data(⋅)=argminf∈H{1mm∑j=1(f(xtrj)−ytrj)2+γ∥f∥2H}, (3)

where is a regularization term to be chosen and the subscript represents . Using the representation theorem [21], optimization problem (3) can be solved in closed form with where

 αreg=(K+γI)−1ytr, (4)

and .

### 2.3 Motivation

Depending on properties of , [35] proves different rates of KMM. The most notable case is when but rather , where is the integral operator on . In this case, [35] characterize with the approximation error

 A2(g,F)≜inf∥f∥H≤F∥g−f∥L2Ptr≤CF−θ2, (5)

and the rates of KMM drops to sub-canonical , as opposed to when . As shown in Lemma 4 in the Appendix and Theorem 4.1 of [6]), (5) is almost equivalent to : implies (5) while (5) leads to for any . We adopt the characterization as our analysis is based on related learning theory estimates. In particular, our proofs rely on these estimates and are different from [35]. For example, in (3), is used as a free parameter for controlling , whereas [35] uses the parameter in (5). Although the two approaches are equivalent from an optimization viewpoint, with being the Lagrange dual variable, the former approach turns out to be more suitable to analyse .

Correspondingly, the convergence rate for when is also shown in [35] as , with taken as in (3) and chosen optimally. The rate of is usually better than due to labelling cost (i.e. ). However, in practice the performance of is not always better than . This could be partially explained by the hidden dependence of on potentially large , but more importantly, without variance reduction, KMM is subject to the negative effects of unstable importance sampling weights (i.e. the ). On the other hand, the training of requires labels hence can only be done on training set. Consequently, without reweighting, when estimating the test quantity , the rate of suffers from the bias.

This motivates the search for a robust estimator which does not require prior knowledge on the performance of or and can, through a combination, reach or even surpass the best performance among both. For simplicity, we use the mean squared error (MSE) criteria and assume an additive model where is independent with and other errors. Under this framework, we motivate a remedy from two perspectives:

#### Variance Reduction for KMM:

Consider an idealized KMM with and being the true RND. Since

 E[β(Xtr)Ytr]=Ex∼Ptr(β(x)g(x))=Ex∼Pte[g(x)]=ν,

is unbiased and the only source of MSE becomes the variance. It then follows from standard control variates that, given an estimator

and a zero-mean random variable

, we can set and use to obtain

 mintVar(V−tW)=(1−corr2(V,W))Var(V)≤Var(V),

without altering the mean of . Thus we can use

 W=1ntrntr∑j=1β(xtrj)(^g(xtrj))−1ntente∑i=1^g(xtei)

with , suppose we have seperately acquired a . To calculate , suppose and are independent, then we have

 Cov(VKMM,W)= 1ntrCov(β(Xtr)Ytr,β(Xtr)^g(Xtr)) = 1ntrCov(β(Xtr)g(Xtr),β(Xtr)^g(Xtr)) ≈ 1ntrVar(β(Xtr)^g(Xtr)),

if is close enough to . On the other hand, in the usual case where ,

 Var(W)= 1ntrVar(β(Xtr)^g(Xtr))+1nteVar(^g(Xte)) ≈1ntrVar(β(Xtr)^g(Xtr)).

Thus, which gives our estimator

 VR=1ntrntr∑j=1β(xtrj)(ytrj−^g(xtrj))+1ntente∑i=1^g(xtei).

#### Bias Reduction for NR:

Consider the NR estimator . Assuming again the common case where , we have

 Var(VNR)=1nteVar(^g(Xte))≈0,

and the main source of MSE is bias . If we add to , we eliminate the bias which gives the same estimator

 VR=1ntrntr∑j=1β(xtrj)(ytrj−^g(xtrj))+1ntente∑i=1^g(xtei).

## 3 Robust Estimator

We construct a new estimator that can be shown to perform robustly against both KMM and NR estimators discussed above. In our construction, we split the training set with a proportion , i.e., divide into

 {XtrKMM,YtrKMM}data≜{(xtrj,ytrj)}⌊ρntr⌋j=1,

and

 {XtrNR,YtrNR}data≜{(xtrj,ytrj)}ntrj=⌊ρntr⌋+1,

where is used to solve for the weight in (1) and is used to train an NR function for some as in (3). Finally, we define our estimator as

 VR(ρ)≜ 1⌊ρntr⌋⌊ρntr⌋∑j=1^β(xtrj)(ytrj−^g(xtrj)) +1ntente∑i=1^g(xtei). (6)

First, we remark the parameter controlling the splitting of data serves mainly for theoretical considerations. In practice, the data can be used for both purposes simultaneously. Second, as mentioned, many other than (3) could be considered for control variate. However, aside from the availability of closed-form expressions (4), is connected to the learning theory estimates [6]. Thus, for establishing a theoretical bound, we focus on for now.

Our main result is the convergence analysis with respect to and which rigorously justified the previous intuition. In particular, we show that either surpasses or achieves the better rate between and . In all theorems that follow, the big- notations can be interpreted either as high probability bound or a bound on expectation. The proofs are left in the Appendix.

###### Theorem 1.

Under Assumptions 1-3, if we assume , the convergence rate of satisfies

 |VR(ρ)−ν|=O(n−θ2θ+2tr+n−θ2θ+2te), (7)

when is taken to be in (3) with and .

###### Corollary 1.

Under the same setting of theorem 1, if we choose , we have

 |VR(ρ)−ν|=O(n−θ2θ+4tr+n−θ2θ+4te) (8)

and if we choose ,

 |VR(ρ)−ν|=O(n−θ2θ+4tr+n−12te). (9)

We remark several implications. First, although not achieving canonical, (7) is an improvement over the best-known rate of when , especially for small , suggesting that is more suitable than when is irregular. Indeed, is a smoothness parameter that measures the regularity of . When increases, functions in get smoother and for , with the limiting case that , and (i.e. ) for universal kernels by Mercer’s theorem.

Second, as in Theorem 4 of [35], the optimal tuning of that leads to (7) depends on the unknown parameter , which may not be adaptive in practice. However, if one simply choose , still achieves a rate no worse than as depicted in (8).

Third, also in Theorem 4 of [35], the rate of is when , which is better on but not . Since usually , the rate of generally excels. Indeed, in this case the rate of beats only if . However, if so, can still achieve rate in (9) which is better than , by simply taking , i.e., regularizing the training process more when the test set is small. Moreover, as , our estimator recovers the canonical rate as opposed to in .

Thus, in summary, when , our estimator outperforms both and across the relative sizes of and . The outperformance over is strict when is chosen dependent on , and the performance is matched when is chosen robustly without knowledge of .

For completeness, we consider two other characterizations of discussed in [35]: one is and the other is for some (e.g., with being the Gaussian kernel, where is the Sobolev space with integer ). The two assumptions are, in a sense, more extreme (being optimistic or pessimistic). The next two results show that the rates of in these situations match the existing ones for (the rates for are not discussed in [35] under these assumptions).

###### Proposition 1.

Under Assumptions 1-3, if , the convergence rate of satisfies , when is taken to be for in (3).

###### Proposition 2.

Under Assumptions 1-3, if for some , the convergence rate of satisfies , when is taken to be for in (3).

## 4 Empirical Risk Minimization

The robust estimator can handle empirical risk minimization (ERM). Given loss function

given in , we optimize over

 minθ∈DE[l′(Xte,Yte;θ)]=minθ∈DEx∼Pte[l(x;θ)],

where to find

 θ⋆≜argminθ∈DEx∼Pte[l(Xte;θ)].

In practice, usually a regularization term on is added. For example, the KMM in [12] considers

 minθ∈D1ntrntr∑j=1^β(xtrj)l′(xtrj,ytrj;θ)+λΩ[θ]. (10)

We can carry out a similar modification for :

 minθ∈D1⌊ρntr⌋ ⌊ρntr⌋∑j=1^β(xtrj)(l′(xtrj,ytrj;θ)−^l(xtrj;θ)) +1ntente∑i=1^l(xtei;θ)+λΩ[θ], (11)

with based on and being an estimate of based on . For later reference, we note that a similar modification can also be used to utilize :

 minθ∈D1ntente∑i=1^l(xtei;θ)+λΩ[θ]. (12)

We discuss two classical learning problems by (4).

#### Penalized Least Square Regression:

Consider a regression problem with , and . We have

 l(x;θ)=E[Y2|x]−2g(x)⟨θ,Φ(x)⟩H+⟨θ,Φ(x)⟩2H,

and a candidate for is to substitute with . Then, (4) becomes

 minθ∈D ⌊ρntr⌋∑j=1−2β(xtrj)⌊ρntr⌋(ytrj−^g(xtrj))⟨θ,Φ(xtrj)⟩H +1ntente∑i=1(^g(xtei)−⟨θ,Φ(x)⟩H)2+λ∥θ∥2H,

by adding and removing the components not involving . Furthermore, it simplifies to the QP:

 minα∈R⌊ρntr⌋+nte −2wT1Ktotα⌊ρntr⌋+λαTKtotα +(w2−Ktotα)TW3(w2−Ktotα)nte, (13)

by the representation theorem [21]. Here and where , , , for and , , , for . Notice (4) has a closed-form solution

 ^α=(W3Ktot+λnteI)−1(nte⌊ρntr⌋w1+w2).

#### Penalized Logistic Regression:

Consider a binary classification problem with , and . Thus, we have

 −l(x;θ)=−g(x)⟨θ,Φ(x)⟩H+log(exp⟨θ,Φ(x)⟩H1+exp⟨θ,Φ(x)⟩H),

and we can again substitute with . Then, (4) becomes

 minθ∈D⌊ρntr⌋∑j=1β(xtrj)⌊ρntr⌋(ytrj−^g(xtrj))⟨θ,Φ(xtrj)⟩H +1ntente∑i=1−^g(xtei)⟨θ,Φ(xtei)⟩H+λ∥θ∥2H +log(exp⟨θ,Φ(xtei)⟩H1+exp⟨θ,Φ(xtei)⟩H).

which again simplifies to, by [21], the convex program:

 minα∈R⌊ρntr⌋+nte wT1Ktotα⌊ρntr⌋−wT2Ktotαnte+λαTKtotα +∑ntei=1log(exp(Ktotα)⌊ρntr⌋+i1+exp(Ktotα)⌊ρntr⌋+i)nte. (14)

Both (4) and (4) can be optimized efficiently by standard solvers. Notably, derived from (4), an optimal solution is in the form which spans on both training and test data. In contrast, the solution of (10) or (12) only spans on one of them. For example, as shown in [12], the penalized least square solution for (10) is where

 ^α=(K+nteλ diag(^β)−1)−1ytr<