In supervised learning, there is a (mostly implicit) assumption that the training data is an unbiased sampling of the underlying distribution of interest. However, that may not be the case. In a variety of problems there is often an unknown bias in the sampling procedure. These arise due to environmental effects, such as temperature in different genome sequencing centers[1, 2, 3]
, or due to the use of particular measuring instruments, such as types of cameras in computer vision[4, 5]
. This means the training dataset (source domain) and the test dataset (target domain) are technically generated by different distributions and generalization might no longer be possible. The challenge lies in using the labeled source data and the unlabeled target data to classify new target data; a problem setting often referred to as domain adaptation, transfer learning or sample selection bias[6, 7, 8, 9, 10]. Most research focuses on classifiers that incorporate information on the difference between the data in both domains, but unfortunately most of these approaches overlook the role of the regularization parameter.
Regularization is used to combat overfitting of complex models and is a vital component in most classifiers to ensure they generalize to unseen data. It consists of a trade-off between how well the classifier can discriminate training samples and how complex it must become to do so. This balance is described by the regularization parameter which is usually estimated by holding out a small subset of unseen labeled data and evaluating the trained classifier (cross-validation). However, since there are no labeled target samples available, it is not possible to construct a target validation set. If one were to alternatively construct a validation set from source data, the estimator converges in distribution to the source risk and not the target risk .
In this paper, we study how the generalization performance of a classifier behaves as a function of the regularization parameter and the domain dissimilarity. There are many factors that influence the value of the optimal regularization parameter, such as the moments of the class-conditional distributions in each domain (differences in variance, skewness, etc.), concept drift (different class priors in each domain), types of adapting classifiers (some require less regularization than others) and high-dimensional distribution estimation errors, but in this paper we focus on differences in variance between domains. The first correction that comes to mind consists of scaling the source validation risk with importance weights and although this remedies the problem somewhat, we show that the optimal regularization parameter for the target domain remains underestimated.
The paper is outlined as follows: section II reviews the regularized empirical risk minimization framework and identifies the problem with selecting the optimal regularization parameter. Section III illustrates the covariate shift setting and how the problem might be resolved there. Section IV-A considers several importance weight estimators with diverse properties while sections IV-B and IV-C present experimental evaluations of their estimates.
Ii Estimation Problem
For a classification problem with a sample space and an event space
, the domains are biased samplings resulting in probability spaces with different probability measuresand . Denote
as the random variable associated with the source domain,, as the random variable associated with the target domain and the classes as elements of . Source data with labels consists of samples from , denoted as , and target data with labels consists of samples from , denoted as . A classifier is a function that takes as input data and outputs a class prediction, .
Ii-B Regularized Risk Minimization
The risk minimization framework allows one to construct classifiers through searching a class of hypothetical functions (e.g., linear) and selecting the one that minimizes the expected loss . The source and target risk are defined respectively as:
Note that by virtue of the shared sample space of the source and target domains, the differentials and are interchangeable, and that, for any , they differ only through the joint probabilities and . The goal is to find a hypothesis , based on a source sample, that will minimize the target risk.
Unfortunately, minimizing the empirical source risk with respect to often leads to a solution that does not generalize well to other samples (overfitting), let alone samples from another distribution. In order to restrict the classifiers ability to match the training sample as well as possible, a complexity term, in the form of the -norm of the hypothesis, is added to the empirical risk during training:
where denotes the set of indices used to select the training samples , denotes the cardinality and denotes the -norm. For the remainder of the paper, we will be working with the -norm.
The regularization parameter trades off the average loss and the -norm. It is usually estimated by defining a set of values , training a classifier for each and selecting that minimizes the empirical risk evaluated on a disjoint validation dataset. The set of regularized classifiers can be denoted as:
where refers to the classifier that is trained using . The regularization parameter space is often taken to be an exponentially increasing set of nonnegative values; for example . Note that regularization is added during training, but not during evaluation.
Evaluation of a classifier consists of its risk on a novel dataset. We will be studying two novel sets, the first being held out source data, because that is usually the only validation data available, and the second being target data, which is the actual measure of interest but is usually not available due to the lack of target labels. Taking the quadratic loss again, the resulting risk is also known as the Mean Squared Error. If we expand the square and plug in the held out source validation data , indexed by disjoint from the training set , and the labeled target samples , the Mean Squared Errors are:
Cross-validation consists of holding out each source sample at least once, training a classifier on the remainder and evaluating on the held out validation set. One round of cross-validation is performed for each and the minimizer of the set with respect to the Mean Squared Error corresponds to the estimated regularization parameter.
For any , the empirical source validation risk converges to the true source risk by independently sampling validation sets infinitely many times:
which is unfortunately not equal to the true target risk . Furthermore, the larger the difference between and , the larger the difference between the minimizers of and with respect to and the larger the error in estimating the optimal regularization parameter.
Iii Covariate Shift
A natural approach to designing a corrected cross-validation procedure, would be to employ some functional relation between the source and target risks. Fortunately, such a relation exists for a subset of the class of domain adaptation problems: if one makes the covariate shift assumption that the class posterior distributions are equivalent in both domains, , then the target risk can be rewritten into a weighted source risk:
and the functional relation thus consists of weighting the source samples appropriately. It can be shown that under the additional assumption of a small domain discrepancy, this problem setting is learnable .
Iii-a Generating a covariate shift setting
Since we are restricting the analysis to covariate shift settings, we need to generate such a problem. First, we choose a set of source class-conditional distributions , a set of priors and compute the class posterior distributions through Bayes’ rule. Then, by choosing a different target distribution , multiplying by the derived class posterior distributions and inverting the Bayes’ rule, the class-conditional target distributions are obtained. Figure 1 (top) visualizes an example of this problem for Gaussian class-conditional distributions. We plotted the labeled source distributions in red and blue with the unlabeled target distributions in black. The class posteriors of this problem are plotted in Figure 1 (bottom), and are equivalent. An artificial dataset can be generated by sampling from these distributions, either through inverse transform sampling or rejection sampling.
If we fix the source class-conditional distributions to be Gaussian distributions, with the blue class asand the red class as , then we can generate 5 problem settings by choosing 5 different target distributions. Figure 2 (top) shows 5 Gaussian target distributions with equal means but with different variances . If we train a classifier based on the source class-conditional distributions and evaluate it using the target MSE, then it becomes apparent that the difference between the minimizer of the source risk and the target risk starts to increase as the difference between the distributions start to increase. Figure 2 (bottom) plots the MSE as a function of for the 5 covariate shift problems, with the minimum for each marked with a black square. Note that for the distributions are equivalent and its minimizer is equivalent to the minimizer of the source risk. The curves show a gradual increase in the minimizers as the variance increases.
Iii-B Difference in MSE curves
If we minimize the MSE curves of both the source validation risk and the target risk with respect to the trained regularized classifier , we obtain:
where the subscripts and denote the optimal regularization parameter according to the source validation and the target risk, respectively. Studying these two forms, we see that these estimates of
differ mainly through their data inner products (i.e., the uncentered, unnormalized covariance matrices). To illustrate this point, we can decompose the data through a singular value decomposition, allowing us to express the minimizers as:
where the diagonal matrices and consist of the normalized singular values and . Apart from a change of basis from to and to
, the difference lies in the scale of the eigenvalues.
If we were to apply a scaling operation to the validation risk, then the difference between these curves can be minimized. Finding the optimal regularization parameter for the target domain will then be equivalent to finding the optimal regularization parameter for the scaled validation risk.
Iii-C Importance Weighted Validation
Sugiyama et al. (2007) employ just such a scaling transformation in the form of importance weighting the validation risk, with the weights as estimates of the ratio of data marginals . These weights scale the risk of each individual validation sample separately. This leads to an importance weighted source validation risk as follows:
where is a matrix with the importance weights as its diagonals. This formulation has the following minimizer:
This ratio of probabilities can have a very large variance, depending on how likely it is to evaluate it for either extremely large target probabilities or extremely small source probabilities. Furthermore, in the small sample size setting, estimation errors increase the likelihood of encountering a numerical explosion, such as when 10 samples are drawn that lie so close together that the estimated target distribution resembles a Dirac distribution. Lastly, the cross-validation estimator has its own variance  which is now directly affected by the variance of the importance weight estimator. For a better understanding of the behavior of an importance weighted cross-validation estimator, we performed a number of experiments with a large diversity of weight estimators in the following section.
We conducted an experiment on an artificial problem setting and one on a typical real-world domain adaption problem where there is no knowledge on whether the covariate shift assumption holds. Our goal is to evaluate the ability of a number of both parametric and nonparametric importance weight estimators to correctly estimate the optimal regularization parameter in the target domain. These experiments illustrate that a large diversity of existing estimators tends to underestimate the optimal target parameter.
Iv-a Importance weight estimators
We selected four importance weight estimators with a diverse set of behaviors.
Iv-A1 Ratio of Gaussians (rG)
A baseline method of estimating the marginal data ratio through modeling each sample set with a separate Gaussian distribution :
where denotes the Gaussian distribution function, the ’s denote the estimated means of the subscripted sample set and the ’s denote the estimated variance of the subscripted sample sets. Note that the data marginals in our problem are actually Gaussian and that this is thus the correctly specified model.
Iv-A2 Kullback-Leibler Importance Estimation Procedure (kliep)
This popular method is based on minimizing the Kullback-Leibler divergence between the reweighted source samples and the target samples:
where the constraint avoids numerical explosions. For we chose a Gaussian kernel, with the kernel width estimated through a separate 3-fold cross-validation .
Iv-A3 Kernel Mean Matching (kmm)
Another popular weight estimator that is motivated by assigning weights that minimize the Maximum Mean Discrepancy (MMD) between the reweighted source and the target samples . The MMD is the distance between the means of two sets of samples under a worst-case transformation (one that pushes them as far away as possible):
where the constraints ensure that the weights are non-negative, bounded above and roughly sum to the sample set size. For the kernel, we selected a radial basis function with Silverman’s rule of thumb for bandwidth selection. Huang et al. recommend setting epsilon toensuring that the allowed deviation from the sample size depends on both the upper bound for each weight and the sample set size itself.
Iv-A4 Nearest Neighbour (nn)
Lastly, we have a nonparametric estimator based on a Voronoi tessellation of the space . The procedure consists of assigning a weight to each source sample based on the number of target samples that are nearest neighbors of it and is proportional, up to the ratio of sample sizes, to the ratio of marginal distributions. It is expressed as:
where refers to the Voronoi cell of sample . The tessellation can be smoothed by adding a value of 1 to each cell, a technique also known as Laplace smoothing.
Iv-B Artificial data
Our first experiment consists of an evaluation of different importance weight estimators and their resulting minimizers of . The set was constructed through linear least squares classifiers , with a range for from -100 to 500. For the source data, we drew 100 samples from two Gaussian class-conditional distribution with means and unit variances . The target class-conditional distributions have the same mean , but with a different set of variances . The ratio of the marginal distributions is sensitive in regions of low probability of the source distribution; really small probabilities in the denominator explode the weight value. Therefore, we expect the minimizers of the importance weight estimators to be close to the target minimizer for smaller target variances . Consequently, we expect erratic behavior for target variance larger than the source variance . Table I displays the minimizers of for the source validation risk, for the different importance weight estimators, for the actual ratio of marginals
, and for the empirical target risk. They are the means and standard errors over 100 repeats.
|4 (19)||3 (20)||5 (20)||3 (20)||4 (20)||4 (19)|
|-15 (20)||-10 (19)||6 (17)||30 (24)||55 (41)||58 (36)|
|-23 (33)||-2 (20)||3 (20)||0 (17)||4 (24)||17 (25)|
|22 (24)||17 (26)||11 (25)||-1 (24)||-15 (21)||-14 (19)|
|-18 (28)||-13 (21)||4 (23)||33 (24)||53 (24)||64 (26)|
|-46 (47)||-33 (21)||3 (18)||46 (44)||72 (50)||77 (50)|
|179 (137)||-24 (21)||5 (21)||102 (28)||207 (38)||296 (45)|
It seems that all importance weight estimators as well as the true ratio of marginals underestimate the target risk minimizer. Furthermore, it seems that leads to increasingly smaller minimizers for an increasing target variance. Even though is increasing, it still underestimates the true value the most. is the most accurate one, but that will probably not be the case if the marginal distributions are not Gaussian anymore (i.e., model misspecification). is the other most accurate one and lies closest to true importance weights. Considering that it does not rely on an assumption of normality, it might be the preferred estimator in a more general setting.
Iv-C Heart disease
The artificial data represents a case where we know exactly what the dissimilarity is between domains and whether assumptions are valid. However, it is also interesting to evaluate on data where we do not have this knowledge. For this we have selected a UCI dataset  on medical data where the domain dissimilarity is caused by a geographically biased sampling of patients. The goal is to classify the presence of a heart disease based on symptoms. The four domains correspond to hospitals in ‘Cleveland’, ‘Virginia’, ‘Hungary’ and ‘Switzerland’, containing 303, 200, 294 and 123 samples each respectively. There are a total of 14 symptoms, but 2 contained so much missing data (
) that these were removed from the set. All other missing data was imputed withII displays the minimizers found by the importance weight estimators compared with those found by the unweighted source validation risk and the empirical target risk , for all combinations of treating one hospital as the source domain and another as the target. Shown are the means and standard errors over 10 repetitions.
|C||V||1 (5)||-1 (8)||1 (5)||9 (13)||2 (5)||500 (0)|
|C||H||1 (4)||4 (6)||1 (6)||2 (14)||4 (5)||500 (0)|
|C||S||4 (6)||7 (9)||0 (5)||9 (12)||-1 (9)||500 (0)|
|V||H||5 (5)||9 (13)||3 (6)||2 (5)||7 (9)||417 (66)|
|V||S||3 (4)||-1 (12)||3 (6)||2 (3)||7 (8)||-60 (284)|
|H||S||3 (6)||34 (48)||3 (8)||44 (40)||4 (6)||500 (0)|
|V||C||4 (5)||-1 (9)||2 (4)||2 (3)||4 (4)||500 (0)|
|H||C||1 (5)||0 (7)||2 (7)||31 (29)||0 (6)||500 (0)|
|S||C||2 (4)||-1 (4)||2 (4)||1 (3)||2 (4)||488 (30)|
|H||V||4 (4)||-15 (14)||4 (7)||25 (43)||4 (9)||500 (0)|
|S||V||2 (4)||-1 (4)||1 (7)||1 (3)||4 (4)||-95 (253)|
|S||H||0 (4)||4 (8)||2 (6)||0 (5)||4 (4)||289 (89)|
The results show that also for real datasets all importance weight estimators underestimate the optimal target regularization parameter. Note that the standard errors are 0 for all that have value 500, which is because 500 is the right boundary of the set . Extending the range further would produce even larger values for the optimal target regularization parameter. It seems that is the best performing estimator here. also produces reasonable results, but that would probably not be the case if we had not z-scored each feature first. That ensures an overlap of the regions with high probability mass in each domain. The other estimators seem to find weight values close to 1, as they are not very different from the unweighted source validation risk.
Considering the significance of regularization to generalization, it would be interesting to further study factors that influence the difference between the risk minimizers in each domain. At the moment we assume that no concept drift has occurred (a difference between class priors in each domain), but if this assumption is violated then the difference in scale depends on the two dominant classes in each domain. The minimizers of the MSE would be dominated by the proportions of samples that belong to one class, which can get very complicated in the multi-class setting. Furthermore, it would be interesting to describe the minimizers in terms of general measures of domain dissimilarity, such as the discrepancy distance  or the -divergence .
The main difficulty in estimating the appropriate weights lies in the fact that it is hard to estimate exactly how the two domains differ from each other. Most adaptation approaches are sensitive to only a particular type of relation between domains or rely on assumptions that can not be checked in advance. Furthermore, estimation errors tend to propagate. For instance, if the distributions of each domain’s data marginals are poorly estimated, then the importance weights explode, leading to a more erroneous estimate of the optimal target regularization parameter. In domain adaptation settings with so many sources of uncertainty, it seems that simple methods work best.
We have shown an empirical analysis of regularization parameter estimation in the context of differing variances in covariate shift problems. It seems that the generalization performance of an unadapted source classifier can be improved by importance weighting the source validation risk. However, most popular weight estimators underestimate the optimal target regularization parameter.
This work was supported by the Netherlands Organization for Scientific Research (NWO; grant 612.001.301).
-  N. H. Shah, C. Jonquet, A. P. Chiang, A. J. Butte, R. Chen, and M. A. Musen, “Ontology-driven indexing of public datasets for translational bioinformatics,” BMC bioinformatics, vol. 10, no. 2, p. 1, 2009.
-  S. Mei, W. Fei, and S. Zhou, “Gene ontology based transfer learning for protein subcellular localization,” BMC bioinformatics, vol. 12, no. 1, p. 1, 2011.
-  Q. Xu and Q. Yang, “A survey of transfer and multitask learning in bioinformatics,” Journal of Computing Science and Engineering, vol. 5, no. 3, pp. 257–268, 2011.
-  K. Saenko, B. Kulis, M. Fritz, and T. Darrell, “Adapting visual category models to new domains,” in Computer Vision–ECCV 2010. Springer, 2010, pp. 213–226.
-  J. Hoffman, E. Rodner, J. Donahue, T. Darrell, and K. Saenko, “Efficient learning of domain-invariant image representations,” arXiv preprint arXiv:1301.3224, 2013.
-  C. Cortes, M. Mohri, M. Riley, and A. Rostamizadeh, “Sample selection bias correction theory,” in Algorithmic learning theory, 2008, pp. 38–53.
J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence,
Dataset shift in machine learning. The MIT Press, 2009.
-  S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan, “A theory of learning from different domains,” Machine learning, vol. 79, no. 1-2, pp. 151–175, 2010.
-  A. Margolis, “A literature review of domain adaptation with unlabeled data,” Rapport Technique, University of Washington, p. 35, 2011.
-  J. G. Moreno-Torres, T. Raeder, R. Alaiz-RodríGuez, N. V. Chawla, and F. Herrera, “A unifying view on dataset shift in classification,” Pattern Recognition, vol. 45, no. 1, pp. 521–530, 2012.
-  M. Sugiyama, M. Krauledat, and K.-R. Müller, “Covariate shift adaptation by importance weighted cross validation,” The Journal of Machine Learning Research, vol. 8, pp. 985–1005, 2007.
S. Ben-David, T. Lu, T. Luu, and D. Pál, “Impossibility theorems for
domain adaptation,” in
International Conference on Artificial Intelligence and Statistics, 2010, pp. 129–136.
-  M. Markatou, H. Tian, S. Biswas, and G. M. Hripcsak, “Analysis of variance of cross-validation estimators of the generalization error,” Journal of Machine Learning Research, vol. 6, pp. 1127–1168, 2005.
-  H. Shimodaira, “Improving predictive inference under covariate shift by weighting the log-likelihood function,” Journal of statistical planning and inference, vol. 90, no. 2, pp. 227–244, 2000.
-  M. Sugiyama, S. Nakajima, H. Kashima, P. V. Buenau, and M. Kawanabe, “Direct importance estimation with model selection and its application to covariate shift adaptation,” in Advances in neural information processing systems, 2008, pp. 1433–1440.
-  J. Huang, A. Gretton, K. M. Borgwardt, B. Schölkopf, and A. J. Smola, “Correcting sample selection bias by unlabeled data,” in Advances in neural information processing systems, 2006, pp. 601–608.
-  M. Loog, “Nearest neighbor-based importance weighting,” in Machine Learning for Signal Processing (MLSP), 2012 IEEE International Workshop on. IEEE, 2012, pp. 1–6.
-  M. Lichman, “UCI machine learning repository,” 2013. [Online]. Available: http://archive.ics.uci.edu/ml
-  Y. Mansour, M. Mohri, and A. Rostamizadeh, “Domain adaptation: Learning bounds and algorithms,” arXiv preprint arXiv:0902.3430, 2009.