1 Introduction
When machine learning models are employed “in the wild”, the distribution of the data of interest(
target distribution) can be significantly shifted compared to the distribution of the data on which the model was trained (source distribution). In many cases, the publicly available largescale datasets with which the models are trained do not represent and reflect the statistics of a particular dataset of interest. This is for example relevant in managed services on cloud providers used by clients in different domains and regions, or medical diagnostic tools trained on data collected in a small number of hospitals and deployed on previously unobserved populations and time frames.Covariate Shift  Label Shift 
There are various ways to approach distribution shifts between a source data distribution and a target data distribution . If we denote input variables as and output variables as , we consider the two following settings: (i) Covariate shift, which assumes that the conditional output distribution is invariant: between source and target distributions, but the source distribution changes. (ii) Label shift, where the conditional input distribution is invariant: and changes from source to target. In the following, we assume that both input and output variables are observed in the source distribution whereas only input variables are available from the target distribution.
While covariate shift has been the focus of the literature on distribution shifts to date, labelshift scenarios appear in a variety of practical machine learning problems and warrant a separate discussion as well. In one setting, suppliers of machinelearning models such as cloud providers have large resources of diverse data sets (source set) to train the models, while during deployment, they have no control over the proportion of label categories.
In another setting of e.g. medical diagnostics, the disease distribution changes over locations and time. Consider the task of diagnosing a disease in a country with bad infrastructure and little data, based on reported symptoms. Can we use data from a different location with data abundance to diagnose the disease in the new target location in an efficient way? How many labeled source and unlabeled target data samples do we need to obtain good performance on the target data?
Apart from being relevant in practice, label shift is a computationally more tractable scenario than covariate shift which can be mitigated. The reason is that the outputs typically have a much lower dimension than the inputs
. Labels are usually either categorical variables with a finite number of categories or have simple welldefined structures. Despite being an intuitively natural scenario in many realworld application, even this simplified model has only been scarcely studied in the literature.
Zhang et al. (2013) proposed a kernel mean matching method for label shift which is not computationally feasible for largescale data. The approach in Lipton et al. (2018)is based on importance weights that are estimated using the confusion matrix (also used in the procedures of
Saerens et al. (2002); McLachlan (2004)) and demonstrate promising performance on largescale data. Using a blackbox classifier which can be biased, uncalibrated and inaccurate, they first estimate importance weights for the source samples and train a classifier on the weighted data. In the following we refer to the procedure as black box shift learning (BBSL) which the authors proved to be effective for large enough sample sizes.However, there are three relevant questions which remain unanswered by their work: How to estimate the importance weights in low sample setting, What are the generalization guarantees for the final predictor which uses the weighted samples? How do we deal with the uncertainty of the weight estimation when only few samples are available? This paper aims to fill the gap in terms of both theoretical understanding and practical methods for the label shift setting and thereby move a step closer towards having a more complete understanding on the general topic of supervised learning for distributionally shifted data. In particular, our goal is to find an efficient method which is applicable to largescale data and to establish generalization guarantees.
Our contribution in this work is trifold. Firstly, we propose an efficient weight estimator for which we can obtain good statistical guarantees without a requirement on the problemdependent minimum sample complexity as necessary for BBSL. In the BBSL case, the estimation error can become arbitrarily large for small sample sizes. Secondly, we propose a novel regularization method to compensate for the high estimation error of the importance weights in low target sample settings. It explicitly controls the influence of our weight estimates when the target sample size is low (in the following referred to as the low sample regime). Finally, we derive a dimensionindependent generalization bound for the final Regularized Learning under Label Shift (RLLS) classifier based on our weight estimator. In particular, our method improves the weight estimation error and excess risk of the classifier on reweighted samples by a factor of , where is the number of classes, i.e. the cardinality of .
In order to demonstrate the benefit of the proposed method for practical situations, we empirically study the performance of RLLS and show weight estimation as well as prediction accuracy comparison for a variety of shifts, sample sizes and regularization parameters on the CIFAR10 and MNIST datasets. For large target sample sizes and large shifts, when applying the regularized weights fully, we achieve an order of magnitude smaller weight estimation error than baseline methods and enjoy at most 20% higher accuracy and F1 score in corresponding predictive tasks. For low target sample sizes, applying regularized weights partially also yields an accuracy improvement of at least 10% over fully weighted and unweighted methods.
2 Regularized learning of label shifts (Rlls)
Formally let us the short hand for the marginal probability mass functions of
on finite with respect to as with , and for all, representable by vectors in
which sum to one. In the label shift setting, we define the importance weight vector between these two domains as . We quantify the shift using the exponent of the infinite and second order Renyi divergence as followsGiven a hypothesis class
and a loss function
, our aim is to find the hypothesis which minimizesIn the usual finite sample setting however, unknown and we observe samples from instead. If we are given the vector of importance weights we could then minimize the empirical loss with importance weighted samples defined as
where is the number of available observations drawn from used to learn the classifier . As is unknown in practice, we have to find the minimizer of the empirical loss with estimated importance weights
(1) 
where are estimates of . Given a set of samples from the source distribution , we first divide it into two sets where we use samples in set to compute the estimate and the remaining in the set to find the classifier which minimizes the loss (1), i.e. . In the following, we describe how to estimate the weights and provide guarantees for the resulting estimator .
Plugin weight estimation
The following simple correlation between the label distributions was noted in Lipton et al. (2018): for a fixed hypothesis , if for all it holds that , we have
for all . This can equivalently be written in matrix vector notation as
(2) 
where is the confusion matrix with and is the vector which represents the probability mass function of under distribution . The requirement is a reasonable condition since without any prior knowledge, there is no way to properly reason about a class in the target domain that is not represented in the source domain.
In reality, both and can only be estimated by the corresponding finite sample averages . Lipton et al. (2018) simply compute the inverse of the estimated confusion matrix in order to estimate the importance weight, i.e. . While is a statistically efficient estimator, with estimated can be arbitrarily bad since
can be arbitrary close to a singular matrix especially for small sample sizes and small minimum singular value of the confusion matrix. Intuitively, when there are very few samples, the weight estimation will have high variance in which case it might be better to avoid importance weighting altogether. Furthermore, even when the sample complexity in
Lipton et al. (2018), unknown in practice, is met, the resulting error of this estimator is linear in which is problematic for large .We therefore aim to address these shortcomings by proposing the following twostep procedure to compute importance weights. In the case of no shift we have so that we define the amount of weight shift as . Given a “decent” black box estimator which we denote by , we make the final classifier less sensitive to the estimation performance of (i.e. regularize the weight estimate) by

calculating the measurement error adjusted (described in Section 2.1 for ) and

computing the regularized weight where depends on the sample size .
By "decent" we refer to a classifier which yields a full rank confusion matrix . A trivial example for a non”decent” classifier is one that always outputs a fixed class. As it does not capture any characteristics of the data, there is no hope to gain any statistical information without any prior information.
2.1 Estimator correcting for finite sample errors
Both the confusion matrix and the label distribution on the target for the black box hypothesis are unknown and we are instead only given access to finite sample estimates . In what follows all empirical and population confusion matrices, as well as label distributions, are defined with respect to the hypothesis . For notation simplicity, we thus drop the subscript in what follows. The reparameterized linear model (2) with respect to then reads
with corresponding finite sample quantity . When is near singular, the estimation of becomes unstable. Furthermore, large values in the true shift result in large variances. We address this problem by adding a regularizing penalty term to the usual loss and thus push the amount of shift towards , a method that has been proposed in (Pires & Szepesvári, 2012). In particular, we compute
(3) 
Here, is a parameter which will eventually be high probability upper bounds for . Let also denote the high probability upper bounds for .
Lemma
For as defined in equation (3), we have with probability at least that^{1}^{1}1Throughout the paper, hides universal constant factors. Furthermore, we use for short to denote .
where
The proof of this lemma can be found in Appendix B.1. A couple of remarks are in order at this point. First of all, notice that the weight estimation procedure (3) does not require a minimum sample complexity which is in the order of to obtain the guarantees for BBSL. This is due to the fact that errors in the covariates are accounted for. In order to directly see the improvements in the upper bound of Lemma 2.1 compared to Theorem 3 in Lipton et al. (2018), first observe that in order to obtain their upper bound with a probability of at least , it is necessary that . As a consequence, the upper bound in Theorem 3 of Lipton et al. (2018) is bigger than . Thus Lemma 2.1 improves upon the previous upper bound by a factor of .
Furthermore, as in Lipton et al. (2018), this result holds for any black box estimator which enters the bound via . We can directly see how a good choice of helps to decrease the upper bound in Lemma 2.1. In particular, if is an ideal estimator, and the source set is balanced, is the unit matrix with . In contrast, when the model is uncertain, the singular value is close to zero.
Moreover, for least square problems with Gaussian measurement errors in both input and target variables, it is standard to use regularized total least squares approaches which requires a singular value decomposition. Finally, our choice for the alternative estimator in Eq.
3 with norm instead of norm squared regularization is motivated by the cases with large shifts , where using the squared norm may shrink the estimate too much and away from the true .2.2 Regularized estimator and generalization bound
When a few samples from the target set are available or the label shift is mild, the estimated weights might be too uncertain to be applied. We therefore propose a regularized estimator defined as follows
(4) 
Note that implicitly depends on , and . By rewriting , we see that intuitively closer to the more reason there is to believe that is in fact the true weight.
Define the set and its Rademacher complexity measure
with
as the Rademacher random variables (see e.g.
Bartlett & Mendelson (2002)). We can now state a generalization bound for the classifier in a general hypothesis class , which is trained on source data with the estimated weights defined in equation (4).Theorem (Generalization bound for )
Given samples from the source data set and samples from the target set, a hypothesis class and loss function , the following generalization bound holds with probability at least
(5) 
where
The proof can be found in Appendix B.4. Additionally, we derive the analysis also for finite hypothesis classes in Appendix B.6 to provide more insight into the proof of general hypothesis classes. The size of is determined by the structure of the function class and the loss . For example for the loss, the VC dimension of can be deployed to upper bound the Rademacher complexity.
The bound (5) in Theorem 2.2 holds for all choices of . In order to exploit the possibility of choosing and to have an improved accuracy depending on the sample sizes, we first let the user define a set of shifts against which we want to be robust against, i.e. all shifts with . For these shifts, we obtain the following upper bound
(6) 
The bound in equation (6) suggests using Algorithm 1 as our ultimate label shift correction procedure. where for step 2 of the algorithm, we choose whenever (hereby neglecting the log factors and thus dependencies on ) and else. When using this rule, we obtain which is smaller than the unregularized bound for small . Notice that in practice, we do not know in advance so that in Algorithm 1 we need to use an estimate of
, which could e.g. be the minimum eigenvalue of the empirical confusion matrix
with an additional computational complexity of at most .Figure 1 shows how the oracle thresholds vary with and when is kept fix. When the parameters are above the curves for fixed , should be chosen as otherwise the samples should be unweighted, i.e. . This figure illustrates that when the confusion matrix has small singular values, the estimated weights should only be trusted for rather high and high believed shifts . Although the overall statistical rate of the excess risk of the classifier does not change as a function of the sample sizes, could be significantly smaller than when is very small and thus the accuracy in this regime could improve. Indeed we observe this to be the case empirically in Section 3.3.
In the case of slight deviation from the label shift setting, we expect the Alg. 1 to perform reasonably. For as the deviation form label shift constraint, i.e., zero under label shift assumption, we have;
Theorem (Drift in Label shift assumption)
In the presence of deviation from label shift assumption, the true importance weights , the RLLS generalizes as;
with high probability. Proof in Appendix B.7.
3 Experiments
In this section we illustrate the theoretical analysis by running RLLS on a variety of artificially generated shifts on the MNIST (LeCun & Cortes, 2010) and CIFAR10 (Krizhevsky & Hinton, 2009) datasets. We first randomly separate the entire dataset into two sets (source and target pool) of the same size. Then we sample, unless specified otherwise, the same number of data points from each pool to form the source and target set respectively. We chose to have equal sample sizes to allow for fair comparisons across shifts.
There are various kinds of shifts which we consider in our experiments. In general we assume one of the source or target datasets to have uniform distribution over the labels. Within the nonuniform set, we consider three types of sampling strategies in the main text: the
TweakOne shift refers to the case where we set a class to have probability , while the distribution over the rest of the classes is uniform. The MinorityClass Shift is a more general version of TweakOne shift, where a fixed number of classes to have probability , while the distribution over the rest of the classes is uniform. For the Dirichlet shift, we draw a probability vector from the Dirichlet distribution with concentration parameter set to for all classes, before including sample points which correspond to the multinomial label variable according to . Results for the tweakone shift strategy as in Lipton et al. (2018) can be found in Section A.0.1.After artificially shifting the label distribution in one of the source and target sets, we then follow algorithm 1, where we choose the black box predictor
to be a twolayer fully connected neural network trained on (shifted) source dataset. Note that any black box predictor could be employed here, though the higher the accuracy, the more likely weight estimation will be precise. Therefore, we use different shifted source data to get (corrupted) black box predictor across experiments. If not noted,
is trained using uniform data.In order to compute in Eq. (3), we call a builtin solver to directly solve the low dimensional problem where we empirically observer that times of the true yields in a better estimator on various levels of label shift precomputed beforehand. It is worth noting that makes the theoretical bound in Lemma. 2.1
times bigger. We thus treat it as a hyperparameter that can be chosen using standard cross validation methods. Finally, we train a classifier on the source samples weighted by
, where we use a twolayer fully connected neural network for MNIST and a ResNet18 (He et al., 2016) for CIFAR10.We sample 20 datasets with the label distributions for each shift parameter. to evaluate the empirical mean square estimation error (MSE) and variance of the estimated weights and the predictive accuracy on the target set. We use these measures to compare our procedure with the black box shift learning method (BBSL) in Lipton et al. (2018). Notice that although KMM methods (Zhang et al., 2013) would be another standard baseline to compare with, it is not scalable to large sample size regimes for above as mentioned by Lipton et al. (2018).
3.1 Weight Estimation and predictive performance for source shift
In this set of experiments on the CIFAR10 dataset, we illustrate our weight estimation and prediction performance for TweakOne source shifts and compare it with BBSL. For this set of experiments, we set the number of data points in both source and target set to and sample from the two pools without replacement.
Figure 2 illustrates the weight estimation alongside final classification performance for MinorityClass source shift of CIFAR10. We created shifts with . We use a fixed blackbox classifier that is trained on biased source data, with tweakone . Observe that the MSE in weight estimation is relatively large and RLLS outperforms BBSL as the number of minority classes increases. As the shift increases the performance for all methods deteriorates. Furthermore, Figure 2 (b) illustrates how the advantage of RLLS over the unweighted classifier increases as the shift increases. Across all shifts, the RLLS based classifier yields higher accuracy than the one based on BBSL. Results for MNIST can be found in Section A.1.
3.2 Weight estimation and predictive performance for target shift
In this section, we compare the predictive performances between a classifier trained on unweighted source data and the classifiers trained on weighted loss obtained by the RLLS and BBSL procedure on CIFAR10. The target set is shifted using the Dirichlet shift with parameters . The number of data points in both source and target set is .
In the case of target shifts, larger shifts actually make the predictive task easier, such that even a constant majority class vote would give high accuracy. However it would have zero accuracy on all but one class. Therefore, in order to allow for a more comprehensive performance between the methods, we also compute the macroaveraged F1 score by averaging the perclass quantity over all classes. For a class , precision is the percentage of correct predictions among all samples predicted to have label , while recall is the proportion of correctly predicted labels over the number of samples with true label . This measure gives higher weight to the accuracies of minority classes which have no effect on the total accuracy.
Figure 3 depicts the MSE of the weight estimation (a), the corresponding performance comparison on accuracy (b) and F1 score (c). Recall that the accuracy performance for low shifts is not comparable with standard CIFAR10 benchmark results because of the overall lower sample size chosen for the comparability between shifts. We can see that in the large target shift case for , the F1 score for BBSL and the unweighted classifier is rather low compared to RLLS while the accuracy is high. As mentioned before, the reason for this observation and why in Figure 3 (b) the accuracy is higher when the shift is larger, is that the predictive task actually becomes easier with higher shift.
3.3 Regularized weights in the low sample regime for source shift
In the following, we present the average accuracy of RLLS in Figure 4 as a function of the number of target samples for different values of for small . Here we fix the sample size in the source set to and investigate a MinorityClass source shift with fixed and five minority classes.
A motivation to use intermediate is discussed in Section 2.2, as in equation (4) may be chosen according to . In practice, since is just an upper bound on the true amount of shift , in some cases should in fact ideally be when . Thus for target sample sizes that are a little bit above the threshold (depending on the certainty of the belief how close to the norm of the shift is believed to be), it could be sensible to use an intermediate value .
Figure 4 suggests that unweighted samples (red) yield the best classifier for very few samples , while for an intermediate (purple) has the highest accuracy and for , the weight estimation is certain enough for the fully weighted classifier (yellow) to have the best performance (see also the corresponding data points in Figure 2). The unweighted BBSL classifier is also shown for completeness. We can conclude that regularizing the influence of the estimated weights allows us to adjust to the uncertainty on importance weights and generalize well for a wide range of target sample sizes.
Furthermore, the different plots in Figure 4 correspond to blackbox predictors for weight estimation which are trained on more or less corrupted data, i.e. have a better or worse conditioned confusion matrix. The fully weighted methods with achieve the best performance faster with a better trained blackbox classifier (a), while it takes longer for it to improve with a corrupted one (c). Furthermore, this reflects the relation between eigenvalue of confusion matrix and target sample size in Theorem 2.2. In other words, we need more samples from the target data to compensate a bad predictor in weight estimation. So the generalization error decreases faster with an increasing number of samples for good predictors.
In summary, our RLLS method outperforms BBSL in all settings for the common image datasets MNIST and CIFAR10 to varying degrees. In general, significant improvements compared to BBSL can be observed for large shifts and the low sample regime. A note of caution is in order: comparison between the two methods alone might not always be meaningful. In particular, there are cases when the estimator trained on unweighted samples outperforms both RLLS and BBSL. Our extensive experiments for many different shifts, black box classifiers and sample sizes do not allow for a final conclusive statement about how weighting samples using our estimator affects predictive results for realworld data in general, as it usually does not fulfill the labelshift assumptions.
4 Related Work
The covariate and label shift assumptions follow naturally when viewing the data generating process as a causal or anticausal model (Schölkopf et al., 2012): With label shift, the label causes the input (that is, is not a causal parent of , hence "anticausal") and the causal mechanism that generates from is independent of the distribution of . A long line of work has addressed the reverse causal setting where causes and the conditional distribution of given is assumed to be constant. This assumption is sensible when there is reason to believe that there is a true optimal mapping from to which does not change if the distribution of changes. Mathematically this scenario corresponds to the covariate shift assumption.
Among the various methods to correct for covariate shift, the majority uses the concept of importance weights (Zadrozny, 2004; Cortes et al., 2010; Cortes & Mohri, 2014; Shimodaira, 2000), which are unknown but can be estimated for example via kernel embeddings (Huang et al., 2007; Gretton et al., 2009, 2012; Zhang et al., 2013; Zaremba et al., 2013) or by learning a binary discriminative classifier between source and target (LopezPaz & Oquab, 2016; Liu et al., 2017). A minimax approach that aims to be robust to the worstcase shared conditional label distribution between source and target has also been investigated (Liu & Ziebart, 2014; Chen et al., 2016). Sanderson & Scott (2014); Ramaswamy et al. (2016) formulate the label shift problem as a mixture of the class conditional covariate distributions with unknown mixture weights. Under the pairwise mutual irreducibility (Scott et al., 2013) assumption on the class conditional covariate distributions, they deploy the NeymanPearson criterion (Blanchard et al., 2010) to estimate the class distribution which also investigated in the maximum mean discrepancy framework (Iyer et al., 2014).
Common issues shared by these methods is that they either result in a massive computational burden for large sample size problems or cannot be deployed for neural networks. Furthermore, importance weighting methods such as (Shimodaira, 2000)
estimate the density (ratio) beforehand, which is a difficult task on its own when the data is highdimensional. The resulting generalization bounds based on importance weighting methods require the second order moments of the density ratio
to be bounded, which means the bounds are extremely loose in most cases (Cortes et al., 2010).Despite the wide applicability of label shift, approaches with global guarantees in high dimensional data regimes remain underexplored. The correction of label shift mainly requires to estimate the importance weights
over the labels which typically live in a very lowdimensional space. Bayesian and probabilistic approaches are studied when a prior over the marginal label distribution is assumed (Storkey, 2009; Chan & Ng, 2005). These methods often need to explicitly compute the posterior distribution ofand suffer from the curse of dimensionality. Recent advances as in
Lipton et al. (2018) have proposed solutions applicable large scale data. This approach is related to Buck et al. (1966); Forman (2008); Saerens et al. (2002) in the low dimensional setting but lacks guarantees for the excess risk.Existing generalization bounds have historically been mainly developed for the case when (see e.g. Vapnik (1999); Bartlett & Mendelson (2002); Kakade et al. (2009); Wainwright (2019)). BenDavid et al. (2010)
provides theoretical analysis and generalization guarantees for distribution shifts when the Hdivergence between joint distributions is considered, whereas
Crammer et al. (2008) proves generalization bounds for learning from multiple sources. For the covariate shift setting, Cortes et al. (2010) provides a generalization bound when is known which however does not apply in practice. To the best of our knowledge our work is the first to give generalization bounds for the label shift scenario.5 Discussion
In this work, we establish the first generalization guarantee for the label shift setting and propose an importance weighting procedure for which no prior knowledge of is required. Although RLLS is inspired by BBSL, it leads to a more robust importance weight estimator as well as generalization guarantees in particular for the small sample regime, which BBSL does not allow for. RLLS is also equipped with a samplesizedependent regularization technique and further improves the classifier in both regimes.
We consider this work a necessary step in the direction of solving shifts of this type, although the label shift assumption itself might be too simplified in the real world. In future work, we plan to also study the setting when it is slightly violated. For instance, in practice cannot be solely explained by the wanted label , but may also depend on attributes which might not be observable. In the disease prediction task for example, the symptoms might not only depend on the disease but also on the city and living conditions of its population. In such a case, the label shift assumption only holds in a slightly modified sense, i.e. . If the attributes are observed, then our framework can readily be used to perform importance weighting.
Furthermore, it is not clear whether the final predictor is in fact “better” or more robust to shifts just because it achieves a better target accuracy than a vanilla unweighted estimator. In fact, there is a reason to believe that under certain shift scenarios, the predictor might learn to use spurious correlations to boost accuracy. Finding a procedure which can both learn a robust model and achieve high accuracies on new target sets remains to be an ongoing challenge. Moreover, the current choice of regularization depends on the number of samples rather than datadriven regularization which is more desirable.
An important direction towards active learning for the same diseasesymptoms scenario is when we also have an expert for diagnosing a limited number of patients in the target location. Now the question is which patients would be most "useful" to diagnose to obtain a high accuracy on the entire target set? Furthermore, in the case of high risk, we might be able to choose some of the patients for further medical diagnosis or treatment, up to some varying cost. We plan to extend the current framework to the active learning setting where we actively query the label of certain
’s (Beygelzimer et al., 2009) as well as the costsensitive setting where we also consider the cost of querying labels (Krishnamurthy et al., 2017).Consider a realizable and overparameterized setting, where there exists a deterministic mapping from to
, and also suppose a perfect interpolation of the source data with a minimum proper norm is desired. In this case, weighting the samples in the empirical loss might not alter the trained classifier
(Belkin et al., 2018). Therefore, our results might not directly help the design of better classifiers in this particular regime. However, for the general overparameterized settings, it remains an open problem of how the importance weighting can improve the generalization. We leave this study for future work.6 Acknowledgement
K. Azizzadenesheli is supported in part by NSF Career Award CCF1254106 and Air Force FA95501510221. This research has been conducted when the first author was a visiting researcher at Caltech. Anqi Liu is supported in part by DOLCIT Postdoctoral Fellowship at Caltech and Caltech’s Center for Autonomous Systems and Technologies. Fan Yang is supported by the Institute for Theoretical Studies ETH Zurich and the Dr. Max Rössler and the Walter Haefner Foundation. A. Anandkumar is supported in part by Microsoft Faculty Fellowship, Google faculty award, Adobe grant, NSF Career Award CCF 1254106, and AFOSR YIP FA95501510221.
References

Anandkumar et al. (2012)
Animashree Anandkumar, Daniel Hsu, and Sham M Kakade.
A method of moments for mixture models and hidden markov models.
In Conference on Learning Theory, pp. 33–1, 2012.  Azizzadenesheli et al. (2016) Kamyar Azizzadenesheli, Alessandro Lazaric, and Animashree Anandkumar. Reinforcement learning of pomdps using spectral methods. arXiv preprint arXiv:1602.07764, 2016.
 Bartlett & Mendelson (2002) Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
 Belkin et al. (2018) Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine learning and the biasvariance tradeoff. arXiv preprint arXiv:1812.11118, 2018.
 BenDavid et al. (2010) Shai BenDavid, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(12):151–175, 2010.
 Beygelzimer et al. (2009) Alina Beygelzimer, Sanjoy Dasgupta, and John Langford. Importance weighted active learning. In Proceedings of the 26th annual international conference on machine learning, pp. 49–56. ACM, 2009.

Blanchard et al. (2010)
Gilles Blanchard, Gyemin Lee, and Clayton Scott.
Semisupervised novelty detection.
Journal of Machine Learning Research, 11(Nov):2973–3009, 2010.  Buck et al. (1966) AA Buck, JJ Gart, et al. Comparison of a screening test and a reference test in epidemiologic studies. ii. a probabilistic model for the comparison of diagnostic tests. American Journal of Epidemiology, 83(3):593–602, 1966.
 Chan & Ng (2005) Yee Seng Chan and Hwee Tou Ng. Word sense disambiguation with distribution estimation. In IJCAI, volume 5, pp. 1010–5, 2005.
 Chen et al. (2016) Xiangli Chen, Mathew Monfort, Anqi Liu, and Brian D Ziebart. Robust covariate shift regression. In Artificial Intelligence and Statistics, pp. 1270–1279, 2016.
 Cortes & Mohri (2014) Corinna Cortes and Mehryar Mohri. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519:103–126, 2014.
 Cortes et al. (2010) Corinna Cortes, Yishay Mansour, and Mehryar Mohri. Learning bounds for importance weighting. In Advances in neural information processing systems, pp. 442–450, 2010.
 Crammer et al. (2008) Koby Crammer, Michael Kearns, and Jennifer Wortman. Learning from multiple sources. Journal of Machine Learning Research, 9(Aug):1757–1774, 2008.
 Forman (2008) George Forman. Quantifying counts and costs via classification. Data Mining and Knowledge Discovery, 17(2):164–206, 2008.
 Freedman (1975) David A Freedman. On tail probabilities for martingales. the Annals of Probability, pp. 100–118, 1975.
 Gretton et al. (2009) Arthur Gretton, Alexander J Smola, Jiayuan Huang, Marcel Schmittfull, Karsten M Borgwardt, and Bernhard Schölkopf. Covariate shift by kernel mean matching. 2009.
 Gretton et al. (2012) Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel twosample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.

He et al. (2016)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778, 2016.  Hsu et al. (2012) Daniel Hsu, Sham M Kakade, and Tong Zhang. A spectral algorithm for learning hidden markov models. Journal of Computer and System Sciences, 78(5):1460–1480, 2012.
 Huang et al. (2007) Jiayuan Huang, Arthur Gretton, Karsten M Borgwardt, Bernhard Schölkopf, and Alex J Smola. Correcting sample selection bias by unlabeled data. In Advances in neural information processing systems, pp. 601–608, 2007.
 Iyer et al. (2014) Arun Iyer, Saketha Nath, and Sunita Sarawagi. Maximum mean discrepancy for class ratio estimation: Convergence bounds and kernel selection. In International Conference on Machine Learning, pp. 530–538, 2014.
 Kakade et al. (2009) Sham M Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Advances in neural information processing systems, pp. 793–800, 2009.
 Krishnamurthy et al. (2017) Akshay Krishnamurthy, Alekh Agarwal, TzuKuo Huang, Hal Daume III, and John Langford. Active learning for costsensitive classification. arXiv preprint arXiv:1703.01014, 2017.
 Krizhevsky & Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 LeCun & Cortes (2010) Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/.
 Lipton et al. (2018) Zachary C Lipton, YuXiang Wang, and Alex Smola. Detecting and correcting for label shift with black box predictors. arXiv preprint arXiv:1802.03916, 2018.
 Liu & Ziebart (2014) Anqi Liu and Brian Ziebart. Robust classification under sample selection bias. In Advances in neural information processing systems, pp. 37–45, 2014.
 Liu et al. (2017) Song Liu, Akiko Takeda, Taiji Suzuki, and Kenji Fukumizu. Trimmed density ratio estimation. In Advances in Neural Information Processing Systems, pp. 4518–4528, 2017.
 LopezPaz & Oquab (2016) David LopezPaz and Maxime Oquab. Revisiting classifier twosample tests. arXiv preprint arXiv:1610.06545, 2016.
 McLachlan (2004) Geoffrey McLachlan. Discriminant analysis and statistical pattern recognition, volume 544. John Wiley & Sons, 2004.
 Pires & Szepesvári (2012) Bernardo Avila Pires and Csaba Szepesvári. Statistical linear estimation with penalized estimators: an application to reinforcement learning. arXiv preprint arXiv:1206.6444, 2012.
 Ramaswamy et al. (2016) Harish Ramaswamy, Clayton Scott, and Ambuj Tewari. Mixture proportion estimation via kernel embeddings of distributions. In International Conference on Machine Learning, pp. 2052–2060, 2016.
 Saerens et al. (2002) Marco Saerens, Patrice Latinne, and Christine Decaestecker. Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural computation, 14(1):21–41, 2002.
 Sanderson & Scott (2014) Tyler Sanderson and Clayton Scott. Class proportion estimation with application to multiclass anomaly rejection. In Artificial Intelligence and Statistics, pp. 850–858, 2014.
 Schölkopf et al. (2012) Bernhard Schölkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On causal and anticausal learning. arXiv preprint arXiv:1206.6471, 2012.
 Scott et al. (2013) Clayton Scott, Gilles Blanchard, and Gregory Handy. Classification with asymmetric label noise: Consistency and maximal denoising. In Conference On Learning Theory, pp. 489–511, 2013.
 Shimodaira (2000) Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of statistical planning and inference, 90(2):227–244, 2000.
 Storkey (2009) Amos Storkey. When training and test sets are different: characterizing learning transfer. Dataset shift in machine learning, pp. 3–28, 2009.
 Tropp (2012) Joel A Tropp. Userfriendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12(4):389–434, 2012.

Vapnik (1999)
Vladimir Naumovich Vapnik.
An overview of statistical learning theory.
IEEE transactions on neural networks, 10(5):988–999, 1999.  Wainwright (2019) M. J. Wainwright. Highdimensional statistics: A nonasymptotic viewpoint. Cambridge University Press, 2019.
 Ying (2004) Yiming Ying. Mcdiarmid’s inequalities of bernstein and bennett forms. City University of Hong Kong, 2004.
 Zadrozny (2004) Bianca Zadrozny. Learning and evaluating classifiers under sample selection bias. In Proceedings of the twentyfirst international conference on Machine learning, pp. 114. ACM, 2004.
 Zaremba et al. (2013) Wojciech Zaremba, Arthur Gretton, and Matthew Blaschko. Btest: A nonparametric, low variance kernel twosample test. In Advances in neural information processing systems, pp. 755–763, 2013.
 Zhang et al. (2013) Kun Zhang, Bernhard Schölkopf, Krikamol Muandet, and Zhikun Wang. Domain adaptation under target and conditional shift. In International Conference on Machine Learning, pp. 819–827, 2013.
Appendix A More experimental results
This section contains more experiments that provide more insights about in which settings the advantage of using RLLS vs. BBSL are more or less pronounced.
a.0.1 CIFAR10 Experiments under tweakone shift and Dirichlet shift
Here we compare weight estimation performance between RLLS and BBSL for different types of shifts including the Tweakone Shift, for which we randomly choose one class, e.g. and set while all other classes are distributed evenly. Figure 5 depicts the the weight estimation performance of RLLS compared to BBSL for a variety of values of and . Note that larger shifts correspond to smaller and larger . In general, one observes that our RLLS estimator has smaller MSE and that as the shift increases, the error of both methods increases. For tweakone shift we can additionally see that as the shift increases, RLLS outperforms BBSL more and more as both in terms of bias and variance.
a.1 MNIST Experiments under MinorityClass source shifts for different values of
In order to show weight estimation and classification performance under different level of label shifts, we include several additional sets of experiments here in the appendix. Figure 6 shows the weight estimation error and accuracy comparison under a minorityclass shift with p = 0.001. The training and testing sample size is 10000 examples in this case. We can see that whenever the weight estimation of RLLS is better, the accuracy is also better, except in the four classes case when both methods are bad in weight estimation.
Figure 7 demonstrates another case in minorityclass shift when . The blackbox classifier is the same twolayers neural network trained on a biased source data set with tweakone . We observe that when the number of minority class is small like 1 or 2, the weight estimation is similar between two methods, as well as in the classification accuracy. But when the shift get larger, the weights are worse and the performance in accuracy decreases, getting even worse than the unweighted classifier.
Figure 8 illustrates the weight estimation alongside final classification performance for MinorityClass source shift of MNIST. We use training and testing data. We created large shifts of three or more minority classes with . We use a fixed blackbox classifier that is trained on biased source data, with tweakone . Observe that the MSE in weight estimation is relatively large and RLLS outperforms BBSL as the number of minority classes increases. As the shift increases the performance for all methods deteriorates. Furthermore, Figure 8 (b) illustrates how the advantage of RLLS over the unweighted classifier increases as the shift increases. Across all shifts, the RLLS based classifier yields higher accuracy than the one based on BBSL.
a.2 CIFAR10 Experiment under Dirichlet source shifts
Figure 9 illustrates the weight estimation alongside final classification performance for Dirichlet source shift of CIFAR10 dataset. We use training and testing data in this experiment, following the way we generate shift on source data. We train with tweakone shifted source data with . The results show that importance weighting in general is not helping the classification in this relatively large shift case, because the weighted methods, including true weights and estimated weights, are similar in accuracy with the unweighted method.
a.3 MNIST Experiment under Dirichlet Shift with low target sample size
We show the performance of classifier with different regularization under a Dirichlet shift with in Figure 10. The training has 5000 examples in this case. We can see that in this low target sample case, only take over after several hundreds example, while some value between 0 and 1 outperforms it at the beginning. Similar as in the paper, we use different blackbox classifier that is corrupted in different levels to show the relation between the quality of blackbox predictor and the necessary target sample size. We use biased source data with tweakone to train the blackbox classifier. We see that we need more target samples for the fully weighted version to take over for a more corrupted blackbox classifier.
Appendix B Proofs
b.1 Proof of Lemma 2.1
From Thm. 3.4 in (Pires & Szepesvári, 2012) we know that for as defined in equation (3), if with probability at least , and hold simultaneously, then
(7) 
where we use the shorthand .
We can get an upper bound on the right hand side of (7) is the infimum by simply choosing a feasible . We then have and hence
as a consequence,
Since by definition of the minimum singular value, we thus have
Let us first notice that
The mathematical definition of the finite sample estimates (in matrix and vector representation) with respect to some hypothesis are as follows
where and is the indicator function. can equivalently be expressed with the population over for and over for respectively. We now use the following concentration Lemmas to bound the estimation errors of where we drop the subscript for ease of notation.
Lemma (Concentration of measurement matrix )
For finite sample estimate we have
with probability at least .
Lemma (Concentration of label measurements)
For the finite sample estimate with respect to any hypothesis it holds that
with probability at least .
b.2 Proof of Lemma b.1
We prove this lemma using the theorem 1.4[Matrix Bernstein] and Dilations technique from Tropp (2012). We can rewrite where
is the onehotencoding of index
. Consider a finite sequence of independent random matrices with dimension . By dilations, lets construct another sequence of selfadjoint random matrices of of dimension , such that for alltherefore,
(8) 
which results in . The dilation technique translates the initial sequence of random matrices to the sequence of random selfadjoint matrices where we can apply the Matrix Bernstein theorem which states that, for a finite sequence of i.i.d. selfadjoint matrices , such that, almost surely and , then for all ,
with probability at least where which is also due to Eq. 8. Therefore, thanks to the dilation trick and theorem 1.4[Matrix Bernstein] in Tropp (2012),
with probability at least .
Now, by plugging in , we have . Together with as well as and setting , we have
b.3 Proof of Lemma b.1
The proof of this lemma is mainly based on a special case of and appreared at proposition 6 in Azizzadenesheli et al. (2016), Lemma F.1 in Anandkumar et al. (2012) and Proposition 19 of Hsu et al. (2012).
Analogous to the previous section we can rewrite where is the onehotencoding of index . Note that (dropping the subscript ) we have
We now bound both estimates of probability vectors separately.
Consider a fixed multinomial distribution characterized with probability vector of where is a dimensional simplex. Further, consider realization of this multinomial distribution where is the onehotencoding of the ’th sample. Consider the empirical estimate mean of this distribution through empirical average of the samples; , then
with probability at least .
By plugging in , with and finally and equivalently for we obtain;
with probability at least , therefore;
resulting in the statement in the Lemma B.1.
b.4 Proof of Theorem 2.2
We want to ultimately bound . By addition and subtraction we have
(9) 
where and we used optimality of . Here (a) is the weight estimation error and (b) is the finite sample error.
Uniform law for bounding (b)
We bound (b) using standard results for uniform laws for uniformly bounded functions which holds since and . Since , by deploying the McDiarmid’s inequality we then obtain that
where and the Rademacher complexity is defined as .
of the hypothesis class (see for example Percy Liang notes on Statistical Learning Theory and chapter 4 in Wainwright (2019))
Bounding term (a)
Remember that is the cardinality of the finite domain of , or the number of classes. Let us define with . Notice that by definition
Comments
There are no comments yet.