Introduction
In standard supervised machine learning, data used to evaluate the generalization error of a classifier is assumed to be independently drawn from the same distribution that generates training samples. This assumption of independent and identically distributed (IID) data is partially violated in the covariate shift setting [shimodaira2000improving], where the conditional label distribution is shared by source and target data, but the distribution on input variables differs between source and target samples, i.e., differs from . All models trained under IID assumptions can suffer from covariate shift and provide overly optimistic extrapolation when generalizing to new data [fan2005improved]. An intuitive and traditional method to address covariate shift is by importance weighting [shimodaira2000improving, zadrozny2004learning]
, which tries to debias the objective loss function by weighting each instance with the ratio
. However, importance weighting not only results in high variance predictions, but also only provides generalization performance guarantees when strong conditions are met by the source and target data distributions
[cortes2010learning].The recently developed robust bias aware (RBA) approach to covariate shift [liu2014robust]
is based on a minimax robust estimation formulation
[grunwald2004game] that assumes the worst case conditional label distribution and requires only source feature expectation matching as constraints. The approach provides conservative target predictions when the target distribution does not have sufficient statistical support from the source data. This statistical support is defined by the choice of source statistics or features. The classifier tries to make the prediction certainty under the target distribution as small as possible, but feature matching constraints prevent it from doing so fully. As a result, less restrictive feature constraints produce less certain predictions on target data from the resulting classifier. As shown in Figure 1(a), with limited features, the classifier may allocate most of the certainty under portions of the source distribution (solid line) where the target distribution (dashed line) density is small to satisfy the source feature expectation matching constraints, leaving too much uncertainty in portions of the target distribution. On the other hand, when there are more restrictive features constraining the conditional label distribution, the classifier produces a better model of the data and gives more informative predictions with less target entropy and logloss, as in Figure 1(b). This relation inspires our contribution: leveraging kernel methods to provide higher dimensional features to the RBA classifier without introducing a proportionate computational burden.According to the representer theorem [Kimeldorf197182], the minimizer of regularized empirical loss in reproducing kernel Hilbert space can be represented by a linear combination of kernel products evaluated on training data. Model parameters are then obtained by estimating the coefficients of this linear combination. However, in the robust biasaware classification framework, the objective function of the dual problem is the regularized expected logarithmic loss under the target data distribution. It cannot be computed explicitly using data because labeled target samples are unavailable. Meanwhile, the distribution discrepancy when evaluating the risk function and sampling training data prevents us from applying the representer theorem directly.
(a) First moment features 
(b) Third moment features 
logloss: 0.74  logloss: 0.53 
entropy: 0.93  entropy: 0.73 
Performance comparison with the robust bias aware classifier using firstorder features (a) and firstorder through thirdorder features (b). Labeled source data samples (‘o’ and ‘+’ classes), source (solid line) and target (dashed line) distributions that data are drawn from are shown. The colormap represents the predicted probability,
‘+’. The intersection of the source distribution and the target distribution is better predicted with thirdorder features and is much more uncertain when only using first moment features. The corresponding target logloss and entropy are shown.A quantitative form of the representer theorem has been proposed that holds for the continuous case [de2004some]
in which a minimizer over a distribution—rather than discrete samples—is sought. The minimizer of regularized expected risk is represented as the expectation under the same probability distribution instead of a linear combination of the training data. We utilize this result to extend the representer theorem for RBA prediction in the covariate shift setting. We show that the minimizer of the regularized expected target risk can be represented as a reweighted kernel expectation under the source distribution. This enables us to apply kernel methods to the robust bias aware classifier.
In this paper, we explore the theoretical foundation of kernel methods for robust covariate shift prediction. We investigate the underlying effect brought by kernelization and establish consistency properties that are realized by applying kernel methods to RBA prediction. We then demonstrate the empirical advantages of the kernel robust bias aware classifier on synthetically biased benchmark datasets as well as datasets that have natural covariate shift bias.
Related Work
To address the shifts between training and testing input distributions, which is known as covariate shift, existing methods often try to reweight the source samples, denoted , to make them more representative of target distribution samples. The theoretical argument supporting this approach [shimodaira2000improving] is that reweighting is asymptotically optimal for minimizing target distribution log loss (equivalently, maximizing target distribution loglikelihood):
(1) 
where we use to represent the estimated predictor and is the empirical distribution.
Most existing covariate shift research follows this idea of seeking an unbiased estimator of target risk
[sugiyama2005model]. Significant attention has been paid to estimating the density ratio , which strongly impacts predictive performance [Bickel2009, dudik2005correcting]. Direct estimation methods estimate the density ratio by minimizing information theoretical criterion like KLdivergence [sugiyama2008direct, kanamori2009efficient, yamada2011relative] or matching kernel means [huang2006correcting, yu2012analysis] rather than estimating the ratio’s numerator and denominator densities separately. Other methods [wen2014robust] consider the ratio as an inner parameter within the model and relate the ratio with model misspecification. There are also methods for specific models of the covariate shift mechanism [zadrozny2004learning, sugiyama2007covariate]. Additional methods have also been recently proposed to address some of the limitations of importance weighting [reddi2015doubly].Theoretical analyses have uncovered the brittleness of importance weighting for covariate shift by analyzing its statistical learning bounds [ben2007analysis, cortes2008sample]. Cortes et al. (cortes2010learning) establish generalization bounds for learning importance weighting under covariate shift that only hold when the second moment of sampled importance weights is bounded, . When not bounded, a small number of data points with large importance weights can dominate the reweighted loss, resulting in high variance predictions.
Kernel methods have been employed for estimating the density ratio in importance weighting methods; for example, kernel mean matching [huang2006correcting, yu2012analysis]
. uses the core idea that the kernel mean in a reproducing kernel Hilbert space (RKHS) of the source data should be close to that of the reweighted target data and the optimal density ratio is obtained by minimizing this difference. Kernel methods have also served as a bridge between the source and the target domains in broader transfer learning or domain adaptation problems. In these approaches, kernel methods are used to project source data and target data into a latent space where the distance between the two distributions is small or can be minimized
[pan2010survey].These existing applications of kernel methods for covariate shift are orthogonal to our approach because they are based on empirical risk minimization formulations with the assumption that source data could somehow be transformed to match target distributions. This differs substantially from our robust approach.
Approach
Robust biasaware classifier
The robust biasaware classification model is based on a minimax robust estimation framework (2). Under this framework, an estimator player first chooses a conditional label distribution to minimize the logloss and then an adversarial player chooses a label distribution from the set () of statisticmatching conditional probability to maximize the logloss [liu2014robust]:
(2) 
Under IID settings, it is known that robust loss minimization is equivalent and dual to empirical risk minimization [grunwald2004game]. From this perspective, RBA prediction modifies the dual robust loss minimization problem in contrast to existing importance weighting methods, which modify the primal empirical risk minimization problem. After distinguishing between source and target distributions and using the logarithmic loss, the robust optimization problem reduces to a maximum entropy problem:
(3)  
where defines the conditional probability simplex that must reside within,
is a vectorvalued feature function that is evaluated on input
, and is a vector of the expected feature values that corresponds with the feature function, which is approximated using source sample data in practice. Solving for the parametric form of from this optimization problem yields:(4) 
with a normalization term defined as . Minimizing the target logarithmic loss,
(5) 
provides parameter vector estimates . This can be accomplished by approximating the gradient using source samples rather than approximating the objective function (5). After plugging (4) into (5), the gradient using source samples is:
(6) 
with . Therefore, the RBA approach directly minimizes the expected target logloss rather than approximating it using importance weighting with finite source samples.
As illustrated by Figure 1, the feature function of the RBA predictor forms the constraints that prevent high levels of uncertainty (entropy) in the target distribution. As a result, more extensive sets of feature constraints may be needed to appropriately constrain the RBA model to provide more certain predictions in portions of the input space where target data is more probable under source distribution, like the intersection of source and target distribution in Figure 1.
Extended representer theorem for RBA
Kernel methods are motivated in the RBA approach to provide a more sufficiently restrictive set of constraints that forces generalization from source data samples to target data. However, the inability to directly apply empirical risk minimization in the RBA approach (5) complicates their incorporation since kernel method applications often use empirical risk minimization as a starting point.
We extend the representer theorem in the RBA approach by first investigating the minimizer of the regularized expected target loss. Theorem 1 shows that the minimizer of a regularized expected target loss can instead be represented by a reweighted expectation under the source distribution. This paves the theoretical foundation of applying kernel methods to RBA, which essentially differs from traditional empirical risk minimization based methods and use expected target loss as a starting point.
(a) Linear  (b) Gaussian  (c) Polynomial2  (d) Polynomial3 
logloss: 0.74  logloss: 0.65  logloss: 0.48  logloss: 0.41 
entropy: 0.93  entropy: 0.86  entropy: 0.67  entropy: 0.45 
Theorem 1.
Let be the input space and be the output space, is a positive definite real valued kernel on with corresponding reproducing kernel Hilbert space , if the training samples are drawn from a source distribution and the testing samples are drawn from a target distribution , any minimizer of (5) in , defining the conditional label distribution,
(7) 
admits a representation with a form such that each
(8) 
where , for , with
(9) 
Proof.
Defining , the robust biasaware label distribution can be rewritten as , with . The objective function (5) is then:
(10)  
where is the function that we aim to find that minimizes this regularized expected loss. Let be a positive definite real valued kernel on , according to the generalized representer theorem [de2004some] in this expected risk case, the minimizer takes the form:
where . Since the target label is not available in training, the minimizer cannot be represented directly by target data. Instead, we represent it using source data, which, for each , is:
Given , we obtain .
∎
Kernel RBA parameter estimation
As in the nonkernelized RBA model, the objective function (5
) is defined in terms of the labeled target distribution data, which is unavailable. However, the parametric model’s form (
7) bypasses this difficulty when employing the kernelized minimizer (8). In order to estimate the parameters , we derive the gradient of the kernel RBA predictor.Corollary 1 (of Theorem 1).
The gradient (with respect to kernelized parameters ) of the regularized expected loss is obtained by approximating kernel evaluations under the source distribution with source sample kernel evaluations.
Proof.
Plugging (9) into (5), we obtain the form of the objective function represented by kernels and take derivatives with respect to :
∎
Corollary 1 indicates that the computation of the gradient only requires source samples. This requires an approximation of the source distribution’s expected kernel evaluations with the empirical evaluations of the sample mean. The reason for the approximation is rooted in the idea of minimizing the exact expected target loss directly in kernel RBA. Consequently, we need to use the empirical gradient to approximate the true gradient. However, the error can be controlled using standard finite sample bounds, like Hoeffding bounds, so that the corresponding error in the objective is also bounded. On the contrary, importance weighted empirical risk minimization (ERM) methods do not approximate the gradient, but approximate the training objective from the beginning as in (1), which is essentially different from our method.
(a) Linear100  (b) Gaussian200  (c) Gaussian300  (d) Gaussian400 
accuracy: 0.760  accuracy: 0.774  accuracy: 0.789  accuracy: 0.815 
Understanding Kernel RBA
In order to illustrate the effectiveness of kernel RBA, we consider the same datasets from Figure 1 and compare linear RBA and kernel RBA with different kernel types and parameters in Figure 2. Even though kernel methods are usually regarded as a way to introduce nonlinearity, its main effect in kernel RBA is the expansion of the constraint space for the adversarial player in the two player game in (2). As in Figure 2, kernel RBA achieves better (smaller target logarithmic loss) and more informative (smaller target prediction entropy) predictions in the intersection of source and target distribution, while the true decision boundary is a linear one. Note that here the Gaussian kernel has a large bandwidth to obtain a more linear decision boundary for better visualization. Moreover, the difference between target entropy and logarithmic loss gradually gets smaller in the last three figures. This corresponds with the property of RBA that target logarithmic loss is always upper bounded by the target entropy (with high probability), as proven for a general case in previous literature [liu2015shift]. Therefore, when a larger number of constraints are imposed, i.e., kernel methods are applied, it forms a more restrictive constraint set for so that target entropy will bound target loss more and more tightly.
Note that the choice of kernel method and kernel parameters depends on the specific learning problem because we also need to account for overfitting issues in practice. The amount of bias also plays a role in how more source constraints brought by kernel methods help improve over RBA method. Specifically, the larger the bias is, the more RBA will suffer from insufficient constraints from source sample data, which results in larger entropy in target predictions.
Consistency Analysis
We now analyze some theoretical properties of the kernel RBA method. As stated before, kernel RBA directly minimizes the regularized expected target loss. We start with defining this expected target loss explicitly, parameterized by learned , at a specific data point as: , where and is the normalization term.
Theorem 2.
Let be an bounded universal kernel, and regularization tending to zero slower than for the kernel RBA method, with as the parameter in the resulting predictor, then .
Proof.
is a Lipschitz loss because it follows the basic form of logistic loss except consists of one more component: the density ratio. Given Theorem 1, the minimizer of expected target can be represented using source samples. It implies that kernel RBA is consistent w.r.t when equipped with a universal kernel [micchelli2006universal] in source data, assuming is accurate, according to consistency properties for Lipschitz loss [steinwart2005consistency]. ∎
Next, we explore whether the optimal expected on the target distribution indicates the optimal 01 loss on the target distribution^{1}^{1}1We assume the density ratio is accurately estimated in this case and leave the analysis for the case when it is approximate to future work..
Corollary 2 (of Theorem 2).
For any pair of distributions that , and , if is the kernel RBA predictor satisfying all the conditions in Theorem 2, then .
Proof.
is a proper composite loss in both the binary [reid2010composite] and multiclass cases [vernet2011composite], which means it satisfies for any , where is the Bayes conditional label probability, is the estimated label probability function from RBA (4) and is a constant. We then have target expected 01 regret be bounded by the expected regret:
where is a predictor function that maps conditional label probability to label. Here the first inequality is due to property of plugin classifiers and Jensen’s inequality and the second inequality directly comes from the definition of proper loss. Therefore, according to Theorem 2, kernel RBA is consistent w.r.t , and we then conclude that . ∎
Note that employing a universal kernel is a sufficient condition for consistency to hold. Therefore, kernel methods not only provide a larger number of features without increasing computational burdens, but also facilitate the theoretical property to hold for kernel RBA.
We demonstrate how the true decision boundary in the target distribution is recovered with an increasing number of samples when source and target distribution are fairly close in Figure 3. As shown in the first figure, the decision boundary in the linear case is tilted due to the noise. Equipped with more samples and a universal kernel (Gaussian kernel), the decision boundary is shifted to align with the true one. At the same time, the accuracy on target data gets better and better, roughly converging to the optimal. This property of kernel RBA corresponds to Corollary 2 that the 01 loss of kernel RBA should converge to the optimal 01 loss in the limit.
As a comparison, we show the plots of logloss and accuracy of Kernel IW (solid line) and Kernel Robust (dashed line) methods after 20 repeated experiments using increasing number of samples in Figure 4. The dataset is similar with the example in Figure 3 with 10% noise and source and target distribution closely overlapped. The kernel used here is Gaussian kernel. As shown in the error bars, even though the importance weighted loss converges to the target loss in the limit in theory, it suffers from larger variance and sensitivity to noise in reality when there is only limited number of samples. The reason is that it can be dominated by data with large weights, like points with ‘+’ labels in the rightupper corner in Figure 3. Those noise points will push the decision boundary to the leftbottom direction in order to suffer less logloss. On the other hand, Kernel Robust is more robust to noise and keeps reducing the variance and improving the mean logloss and accuracy. This is not only due to the inherently more modest predictions that robust methods produce on biased target distribution, but also due to the consistency property it enjoys as stated in Theorem 2 and Corollary 2. Even though the number of samples is still small and limited here, the source and target distribution is close enough to reflect the convergence tendency with the increasing of source samples.
(a) Logloss  (b) Accuracy 

. The error bar shows the 95% confidence interval of the sampling distribution after 20 repeated experiments. IW methods suffer from large variance as robust methods gradually reduce variance and improves on logloss and accuracy more consistently.
Experiments
In this section, we demonstrate the advantages of our kernel RBA approach on datasets that are either synthetically biased via sampling
or naturally biased by a differing characteristic or noise.
We chose three datasets from the UCI repository [UCI, dataset_vehicle] for synthetically biased experiments, based on the criteria that each contains approximately
1,000 or more examples and has minimal missing values. They are Vehicle
, Segment
and Sat
. For each dataset, we
synthetically generate 20
separate experiments by taking
200 source samples and 200 target data samples from it, generally following the sampling procedure described in huang2006correcting (huang2006correcting), which
we summarize as:

[topsep=0pt,itemsep=1ex,partopsep=1ex,parsep=1ex]

Separate the data into source and target portion according to mean of a variable;

Randomly sample the target portion as the target dataset;

In the source portion, calculate the sample mean and sample covariance , then sample in proportion to weights generated from a multivariate Gaussian with and as the source dataset. If the dimension is too large to sample any points, perform PCA first and use the first several principle components to obtain the weights.
We also investigate three naturally biased covariate shift datasets. One of them is Abalone
,
in which we use the sex variable (male, female, and infant) to create bias. Specifically, we use infant as source samples and the rest as target samples. Note that we use the simplified 3category classification problem of the Abalone
dataset as described in Clark
et al. [clark1996quantitative] and also sample 200 data points respectively for the source and target datasets.
We chose this data because the sex variable makes sourcetarget separation easier and reasonable,
and allows the covariate shift assumption to generally hold. In addition, we evaluate our methods on the MNIST dataset [lecun1998gradient]
, which we reduc to binary predictive tasks of differentiating ‘3’ versus ‘8’ and ‘7’ versus ‘9’. We add a biased Gaussian noise with mean 0.2 and standard deviation 0.5 to the testing data to form the covariate shift, i.e. noise
. We randomly sample 2000 training and testing samples and repeat the experiments 20 times. Shown in Figure 5 is the comparison between one batch of training samples and testing samples.(a) Training Samples  (b) Testing Samples 

Dataset  Kernel Robust  Kernel LR  Kernel IW  Robust  LR  IW 
Vehicle 
1.92  16.41  87.69  1.94  8.15  4.94 
Segment 
2.53  9.62  83.75  2.55  4.37  4.01 
Sat 
2.44  205.27  111.57  2.57  13.27  8.95 
Abalone 
1.58  8.52  6.91  1.59  8.73  2.09 
MNIST7v9 
0.42  0.44  0.49  0.55  0.80  0.59 
MNIST3v8 
0.39  0.46  0.41  0.48  0.84  0.60 
Methods
We evaluate our approach and five other methods:
Kernel robust bias aware classifier (Kernel Robust) adversarially
minimizes the target distribution logloss using kernel methods, trained using direct gradient calculations as in Corollary 1.
Kernel logistic regression (Kernel LR)
ignores the covariate shift and maximizes the source data conditional likelihood, , where and is the regularization constant.Kernel importance weighting method (Kernel IW) maximizes the conditional target data likelihood as estimated using importance weighting with the density ratio, .
Linear robust bias aware prediction (Robust) adversarially minimizes the target distribution logloss without utilizing kernelization , i.e. only first order features are used, trained using direct gradient calculations (6).
Linear logistic regression (LR) utilizes only first order features in the source conditional log likelihood maximization.
Linear importance weighting method (IW) uses first order features only to maximize reweighted source likelihood.
Model Selection
For each kernelized method, we employ a polynomial kernel with order 2. We choose regularization parameter by 5fold cross validation, or importance weighted cross validation (IWCV) from . We apply traditional cross validation on Kernel LR and LR, and apply IWCV on both importance weighting methods and robust methods. Note that the traditional cross validation process is not correct anymore in the covariate shift setting, because under the covariate shift assumption, the source marginal data distribution of is different from the target distribution [sugiyama2007covariate]. Though IWCV was originally designed for the importance weighting methods, it is proven to be unbiased for any loss function. We apply it to perform model tuning for our robust methods, even though the error estimate variance could be large.
Logistic regression as density estimation
We use a discriminative density estimation method that leverages the logistic regression classifier for estimating the density ratios. According to Bayes rule: where the second ratio is computed as the ratio of the number of target and source examples, and the first one is obtained by training a classifier with source data labeled as one class and target data as another class. Similar ideas also appears in recent literature [lopez2016revisiting]. The resulting density ratio of this method is also closely controlled by the amount of regularization. We also choose the regularization weight by cross validation.
Performance Evaluation
We compare average logloss, , for each method in Table 1
. We perform a paired ttest among each pair of methods. We indicate the methods that have the best performance in bold, along with methods that are statistically indistinguishable from the best (paired ttest with
significance level). As shown from the table, the average logloss of the Kernel Robust method is significantly better or not significantly worse than all of the alternatives in all of the datasets. Moreover, we observe the following:First, logloss of Kernel Robust and Robust is bounded by the uniform distribution baselines, while LR and IW methods can be arbitrary worse when the bias is large, like in
Vehicle
. This aligns with the properties of robust methods because when the bias is large, the density ratio becomes small and results in uniform predictions. This indicates that robust methods should be preferred if robustness or safety is a concern when the amount of covariate shift is large.
Secondly, Kernel Robust consistently improves the performance from Robust while kernelization may harm LR and IW methods, like in Sat
. The reason is when the implicit assumption that (reweighted) source features can be generalize to target distribution in LR and IW does not hold anymore, incorporating larger dimensions of features could make predictions worse. For Kernel Robust and Robust, even though overfitting could still be a concern, the density ratio could adjust the certainty of the prediction and function like a regularizer based on the data’s density in training and testing distribution, so that they suffer less from overfitting.
Finally, we find that Kernel Robust improvement over Robust is related to how far the source input distributions is from the target input distribution. The natural bias in Abalone
comes from one feature variable and could be smaller than the bias in synthetic data. This could be why the improvement of logloss in Abalone is smaller than other datasets.
Conclusion
Providing meaningful and robust predictions under covariate shift is challenging. Kernel methods are one avenue for considering large or infinite feature spaces without incurring a proportionate computational burden. We investigated the underlying theoretical foundations for applying kernel methods to RBA by extending the generalized representer theorem, which makes it possible to represent the minimizer of the regularized expected loss with reweighted kernel expectations under the source distribution, and therefore minimize the objective using gradient calculations that only depend on source samples. In addition, we presented the implication of kernel RBA in providing more restrictive feature matching constraints and tighter entropy bounds for target loss, and demonstrated that kernel RBA is both consistent w.r.t its own expected target loss and 01 loss. We experimentally validated the advantages of kernelized RBA with synthetically subsampled benchmark data and naturally biased data.
References
Appendix A Supplementary Materials
Dataset Details
We show the more detailed information about the datasets we used in the experiment in the following tables. We expect the method to also work for higher dimensional dataset when equipped with accurate density ratio estimation. Since the development and analysis of this paper focus more on the Kernel RBA method itself and not on density estimation, we believe smaller datasets are more suitable for the evaluation. We leave the problem of being robust to possibly inaccurate density ratios in higher dimension to future work.
Dataset  Features  Examples  Classes 

Vehicle 
18  846  4 
Segment 
19  2310  7 
Sat 
36  6435  7 
Abalone 
7  4177  3 
MNIST3v8 
784  5885  2 
MNIST7v9 
784  5959  2 
Accuracy analysis
We
investigate the accuracy (the complement of the misclassification error)
of the predictions provided by each of the six approaches on both
synthetically biased datasets and naturally biased datasets (in Table 3),
where the significant best performance in paired ttest are demonstrated in bold numbers. The significance level here is 0.05.
Despite the discrepancy between the logarithmic loss and the
misclassification error, the
Kernel Robust approach provides statistically better performance than other alternative methods, except on the Abalone
dataset.
The logarithmic loss is an upper bound of the 01 loss.
However,
the bound can be somewhat loose, so a lower log loss does not necessarily indicate a smaller classification error rate.
This is a natural outcome of using logarithmic loss for convenience of optimization.
Since logloss is the natural loss measure for probabilistic prediction and is being optimized by all methods (and not accuracy), we validate our method by comparing to other methods using it.
Accuracy and logloss do not correlated perfectly, so it is unsurprising that this small difference exists on a measure not being directly optimized.
Dataset  Kernel Robust  Kernel LR  Kernel IW  Robust  LR  IW 
Vehicle 
38%  37%  33%  36%  36%  28% 
Segment 
71%  70%  37%  67%  68%  36% 
Sat 
33%  30%  28%  10%  10%  16% 
Abalone 
46%  43%  42%  48%  47%  39% 
MNIST3v8 
88%  86%  86%  87%  75%  85% 
MNIST7v9 
87%  85%  86%  86%  71%  83% 