## 1 Introduction

Machine learning models deployed in the real world
typically encounter examples
from previously unseen distributions.
While the IID assumption enables us to
evaluate models using held-out data
from the *source* distribution
(from which training data is sampled),
this estimate is no longer valid
in presence of a distribution shift.
Moreover, under such shifts,
model accuracy tends to degrade
(szegedy2013intriguing; recht2019imagenet; wilds2021).
Commonly, the only data
available to the practitioner
are a labeled training set (source)
and unlabeled deployment-time data which makes the problem more difficult.
In this setting, detecting shifts
in the distribution of covariates
is known to be possible (but difficult)
in theory (ramdas2015decreasing),
and in practice (rabanser2018failing).
However, producing an optimal predictor
using only labeled source and unlabeled target data
is well-known to be impossible absent further assumptions (ben2010impossibility; lipton2018detecting).

Two vital questions that remain are: (i) the precise conditions under which we can estimate a classifier’s target-domain accuracy; and (ii) which methods are most practically useful. To begin, the straightforward way to assess the performance of a model under distribution shift would be to collect labeled (target domain) examples and then to evaluate the model on that data. However, collecting fresh labeled data from the target distribution is prohibitively expensive and time-consuming, especially if the target distribution is non-stationary. Hence, instead of using labeled data, we aim to use unlabeled data from the target distribution, that is comparatively abundant, to predict model performance. Note that in this work, our focus is

*not*to improve performance on the target but, rather, to estimate the accuracy on the target for a given classifier.

Recently, numerous methods have been proposed for this purpose (deng2021labels; chen2021mandoline; jiang2021assessing; deng2021does; guillory2021predicting). These methods either require calibration on the target domain to yield consistent estimates (jiang2021assessing; guillory2021predicting)

or additional labeled data from several target domains to learn a linear regression function on a distributional distance that then predicts model performance

(deng2021does; deng2021labels; guillory2021predicting). However, methods that require calibration on the target domain typically yield poor estimates since deep models trained and calibrated on source data are not, in general, calibrated on a (previously unseen) target domain (ovadia2019can). Besides, methods that leverage labeled data from target domains rely on the fact that unseen target domains exhibit strong linear correlation with seen target domains on the underlying distance measure and, hence, can be rendered ineffective when such target domains with labeled data are unavailable (in sec:exp_results we demonstrate such a failure on a real-world distribution shift problem). Therefore, throughout the paper, we assume access to labeled source data and only unlabeled data from target domain(s).In this work, we first show that
absent assumptions
on the source classifier or the nature of the shift,
no method of estimating accuracy will work generally
(even in non-contrived settings).
To estimate accuracy on target domain *perfectly*,
we highlight that even given perfect knowledge
of the labeled source distribution (i.e., )
and unlabeled target distribution (i.e., ),
we need restrictions on the nature of the shift
such that we can uniquely identify
the target conditional .
Thus, in general, identifying the accuracy of the classifier
is as hard as identifying the optimal predictor.

Second, motivated by the superiority of methods that use maximum softmax probability (or logit) of a model for Out-Of-Distribution (OOD) detection

(hendrycks2016baseline; hendrycks2019scaling), we propose a simple method that leverages softmax probability to predict model performance. Our method, Average Thresholded Confidence (ATC), learns a threshold on a score (e.g., maximum confidence or negative entropy) of model confidence on validation source data and predicts target domain accuracy as the fraction of unlabeled target points that receive a score above that threshold. ATC selects a threshold on validation source data such that the fraction of source examples that receive the score above the threshold match the accuracy of those examples. Our primary contribution in ATC is the proposal of obtaining the threshold and observing its efficacy on (practical) accuracy estimation. Importantly, our work takes a step forward in positively answering the question raised in deng2021labels; deng2021does about a practical strategy to select a threshold that enables accuracy prediction with thresholded model confidence.ATC is simple to implement with existing frameworks, compatible with arbitrary model classes, and dominates other contemporary methods. Across several model architectures on a range of benchmark vision and language datasets, we verify that ATC outperforms prior methods by at least – in predicting target accuracy on a variety of distribution shifts. In particular, we consider shifts due to common corruptions (e.g., ImageNet-C), natural distribution shifts due to dataset reproduction (e.g., ImageNet-v2, ImageNet-R), shifts due to novel subpopulations (e.g., Breeds), and distribution shifts faced in the wild (e.g., Wilds).

As a starting point for theory development, we investigate ATC on a simple toy model that models distribution shift with varying proportions of the population with spurious features, as in nagarajan2020understanding. Finally, we note that although ATC achieves superior performance in our empirical evaluation, like all methods, it must fail (returns inconsistent estimates) on certain types of distribution shifts, per our impossibility result.

## 2 Prior Work

Out-of-distribution detection. The main goal of OOD detection is to identify previously unseen examples, i.e., samples out of the support of training distribution. To accomplish this, modern methods utilize confidence or features learned by a deep network trained on some source data. hendrycks2016baseline; geifman2017selective used the confidence score of an (already) trained deep model to identify OOD points. lakshminarayanan2016simple use entropy of an ensemble model to evaluate prediction uncertainty on OOD points. To improve OOD detection with model confidence, liang2017enhancing propose to use temperature scaling and input perturbations. jiang2018trust propose to use scores based on the relative distance of the predicted class to the second class. Recently, residual flow-based methods were used to obtain a density model for OOD detection (zhang2020hybrid). ji2021predicting proposed a method based on subfunction error bounds to compute unreliability per sample. Refer to ovadia2019can; ji2021predicting for an overview and comparison of methods for prediction uncertainty on OOD data.

Predicting model generalization. Understanding generalization capabilities of overparameterized models on in-distribution data using conventional machine learning tools has been a focus of a long line of work; representative research includes neyshabur2015norm; neyshabur2017exploring; neyshabur2017implicit; neyshabur2018role; dziugaite2017computing; bartlett2017spectrally; zhou2018non; long2019generalization; nagarajan2019deterministic. At a high level, this line of research bounds the generalization gap directly with complexity measures calculated on the trained model. However, these bounds typically remain numerically loose relative to the true generalization error (zhang2016understanding; nagarajan2019uniform). On the other hand, another line of research departs from complexity-based approaches to use unseen unlabeled data to predict in-distribution generalization (platanios2016estimating; platanios2017estimating; garg2021ratt; jiang2021assessing).

Relevant to our work are methods for predicting the error of a classifier on OOD data based on unlabeled data from the target (OOD) domain. These methods can be characterized into two broad categories: (i) Methods which explicitly predict correctness of the model on individual unlabeled points (deng2021labels; jiang2021assessing; deng2021does); and (ii) Methods which directly obtain an estimate of error with unlabeled OOD data without making a point-wise prediction (chen2021mandoline; guillory2021predicting; chuang2020estimating).

To achieve a consistent estimate of the target accuracy,
jiang2021assessing; guillory2021predicting require
calibration on target domain.
However, these methods typically yield poor estimates
as deep models trained and calibrated on some source data
are seldom calibrated on previously
unseen domains (ovadia2019can).
Additionally, deng2021labels; guillory2021predicting
derive model-based distribution statistics
on unlabeled target set that correlate
with the target accuracy and propose
to use a subset of *labeled* target
domains to learn a (linear) regression
function that predicts model performance.
However, there are two drawbacks with this approach:
(i) the correlation of these distribution
statistics can vary substantially as we consider
different nature of shifts
(refer to sec:exp_results,
where we empirically demonstrate this failure);
(ii) even if there exists a (hypothetical)
statistic with strong correlations,
obtaining labeled target domains (even simulated ones)
with strong correlations would require significant
*a priori* knowledge about the nature of shift
that, in general, might not be available
before models are deployed in the wild.
Nonetheless, in our work, we only assume access
to labeled data from the source domain presuming
no access to labeled target domains or
information about how to simulate them.

Moreover, unlike the parallel work of deng2021does, we do not focus on methods that alter the training on source data to aid accuracy prediction on the target data. chen2021mandoline propose an importance re-weighting based approach that leverages (additional) information about the axis along which distribution is shifting in form of “slicing functions”. In our work, we make comparisons with importance re-weighting baseline from chen2021mandoline as we do not have any additional information about the axis along which the distribution is shifting.

## 3 Problem Setup

Notation. By , and

we denote the Euclidean norm and inner product, respectively. For a vector

, we use to denote its entry, and for an event we let denote the binary indicator of the event.Suppose we have a multi-class classification problem with the input domain and label space . For binary classification, we use . By and , we denote source and target distribution over . For distributions and , we define or as the corresponding probability density (or mass) functions. A dataset contains points sampled i.i.d. from . Let be a class of hypotheses mapping to where is a simplex in dimensions. Given a classifier and datum , we denote the 0-1 error (i.e., classification error) on that point by . Given a model , our goal in this work is to understand the performance of on without access to labeled data from . Note that our goal is not to adapt the model to the target data. Concretely, we aim to predict accuracy of on . Throughout this paper, we assume we have access to the following: (i) model ; (ii) previously-unseen (validation) data from ; and (iii) unlabeled data from target distribution .

### 3.1 Accuracy Estimation: Possibility and Impossibility Results

First, we investigate the question of when it is possible to estimate the target accuracy of an arbitrary classifier, even given knowledge of the full source distribution and target marginal . Absent assumptions on the nature of shift, estimating target accuracy is impossible. Even given access to and , the problem is fundamentally unidentifiable because can shift arbitrarily. In the following proposition, we show that absent assumptions on the classifier (i.e., when can be any classifier in the space of all classifiers on ), we can estimate accuracy on the target data iff assumptions on the nature of the shift, together with and , uniquely identify the (unknown) target conditional . We relegate proofs from this section to app:proof_setup.

Absent further assumptions, accuracy on the target is identifiable iff is uniquely identified given and .

prop:characterization states that we need enough constraints on nature of shift such that and identifies unique . It also states that under some assumptions on the nature of the shift, we can hope to estimate the model’s accuracy on target data. We will illustrate this on two common assumptions made in domain adaptation literature: (i) covariate shift (heckman1977sample; shimodaira2000improving) and (ii) label shift (saerens2002adjusting; zhang2013domain; lipton2018detecting). Under covariate shift assumption, that the target marginal support is a subset of the source marginal support and that the conditional distribution of labels given inputs does not change within support, i.e., , which, trivially, identifies a unique target conditional . Under label shift, the reverse holds, i.e., the class-conditional distribution does not change () and, again, information about uniquely determines the target conditional (lipton2018detecting; garg2020unified). In these settings, one can estimate an arbitrary classifier’s accuracy on the target domain either by using importance re-weighting with the ratio in case of covariate shift or by using importance re-weighting with the ratio in case of label shift. While importance ratios in the former case can be obtained directly when and are known, the importance ratios in the latter case can be obtained by using techniques from saerens2002adjusting; lipton2018detecting; azizzadenesheli2019regularized; alexandari2019adapting. In app:estimate_label_covariate,we explore accuracy estimation in the setting of these shifts and present extensions to generalized notions of label shift (tachet2020domain) and covariate shift (rojas2018invariant).

As a corollary of prop:characterization, we now present a simple impossibility result, demonstrating that no single method can work for all families of distribution shift.

Absent assumptions on the classifier , no method of estimating accuracy will work in all scenarios, i.e., for different nature of distribution shifts. Intuitively, this result states that every method of estimating accuracy on target data is tied up with some assumption on the nature of the shift and might not be useful for estimating accuracy under a different assumption on the nature of the shift. For illustration, consider a setting where we have access to distribution and . Additionally, assume that the distribution can shift only due to covariate shift or label shift without any knowledge about which one. Then corollary:impossible says that it is impossible to have a single method that will simultaneously for both label shift and covariate shift as in the following example (we spell out the details in app:proof_setup):

Example 1. Assume binary classification with , , , and where , , and . Error of a classifier on target data is given by under covariate shift and by under label shift. In app:proof_setup, we show that for all . Thus, given access to , and , any method that consistently estimates error of a classifer under covariate shift will give an incorrect estimate of error under label shift and vice-versa. The reason is that the same and can correspond to error (under covariate shift) or error (under label shift) and determining which scenario one faces requires further assumptions on the nature of shift.

## 4 Predicting accuracy with Average Thresholded Confidence

In this section, we present our method ATC that leverages a black box classifier and (labeled) validation source data to predict accuracy on target domain given access to unlabeled target data. Throughout the discussion, we assume that the classifier is fixed.

Before presenting our method, we introduce some terminology. Define a score function that takes in the softmax prediction of the function and outputs a scalar. We want a score function such that if the score function takes a high value at a datum then is likely to be correct. In this work, we explore two such score functions: (i) Maximum confidence, i.e., ; and (ii) Negative Entropy, i.e., . Our method identifies a threshold on source data such that the expected number of points that obtain a score less than match the error of on , i.e.,

(1) |

and then our error estimate on the target domain is given by the expected number of target points that obtain a score less than , i.e.,

(2) |

In short, in (1), ATC selects a threshold on the score function such that the error in the source domain matches the expected number of points that receive a score below and in (2), ATC predicts error on the target domain as the fraction of unlabeled points that obtain a score below that threshold . Note that, in principle, there exists a different threshold on the target distribution such that (1) is satisfied on . However, in our experiments, the same threshold performs remarkably well. The main empirical contribution of our work is to show that the threshold obtained with (1) might be used effectively in condunction with modern deep networks in a wide range of settings to estimate error on the target data. In practice, to obtain the threshold with ATC, we minimize the difference between the expression on two sides of (1) using finite samples. In the next section, we show that ATC precisely predicts accuracy on the OOD data on the desired line . In app:interpretation, we discuss an alternate interpretation of the method and make connections with OOD detection methods.

## 5 Experiments

*Scatter plot of predicted accuracy versus (true) OOD accuracy.*Each point denotes a different OOD dataset, all evaluated with the same DenseNet121 model. We only plot the best three methods. With ATC (ours), we refer to ATC-NE. We observe that ATC significantly outperforms other methods and with ATC, we recover the desired line with a robust linear fit. Aggregated estimation error in table:error_estimation and plots for other datasets and architectures in app:results.

We now empirical evaluate ATC and compare it with existing methods. In each of our main experiment, keeping the underlying model fixed, we vary target datasets and make a prediction of the target accuracy with various methods given access to only unlabeled data from the target. Unless noted otherwise, all models are trained only on samples from the source distribution with the main exception of pre-training on a different distribution. We use labeled examples from the target distribution to only obtain true error estimates.

Datasets. First, we consider synthetic shifts induced due to different visual corruptions (e.g., shot noise, motion blur etc.) under ImageNet-C (hendrycks2019benchmarking). Next, we consider natural shifts due to differences in the data collection process of ImageNet (russakovsky2015imagenet), e.g, ImageNetv2 (recht2019imagenet). We also consider images with artistic renditions of object classes, i.e., ImageNet-R (hendrycks2021many) and ImageNet-Sketch (wang2019learning). Note that renditions dataset only contains a subset classes from ImageNet. To include renditions dataset in our testbed, we include results on ImageNet restricted to these classes (which we call ImageNet-200) along with full ImageNet.

Second, we consider Breeds (santurkar2020breeds) to assess robustness to subpopulation shifts, in particular, to understand how accuracy estimation methods behave when novel subpopulations not observed during training are introduced. Breeds leverages class hierarchy in ImageNet to create 4 datasets Entity-13, Entity-30, Living-17, Non-living-26. We focus on natural and synthetic shifts as in ImageNet on same and different subpopulations in BREEDs. Third, from Wilds (wilds2021) benchmark, we consider FMoW-Wilds (christie2018functional), RxRx1-Wilds (taylor2019rxrx1), Amazon-Wilds (ni2019justifying), CivilComments-Wilds (borkan2019nuanced) to consider distribution shifts faced in the wild.

Finally, similar to ImageNet, we consider (i) synthetic shifts (CIFAR-10-C) due to common corruptions; and (ii) natural shift (i.e., CIFARv2 (recht2018cifar)) on CIFAR-10 (krizhevsky2009learning). On CIFAR-100, we just have synthetic shifts due to common corruptions. For completeness, we also consider natural shifts on MNIST (lecun1998mnist) as in the prior work (deng2021labels). We use three real shifted datasets, i.e., USPS (hull1994database)

, SVHN

(netzer2011reading) and QMNIST (qmnist-2019). We give a detailed overview of our setup in app:dataset.Architectures and Evaluation. For ImageNet, Breeds, CIFAR, FMoW-Wilds, RxRx1-Wilds datasets, we use DenseNet121 (huang2017densely) and ResNet50 (he2016deep) architectures. For Amazon-Wilds and CivilComments-Wilds, we fine-tune a DistilBERT-base-uncased (Sanh2019DistilBERTAD) model.

For MNIST, we train a fully connected multilayer perceptron. We use standard training with benchmarked hyperparameters. To compare methods, we report average absolute difference between the true accuracy on the target data and the estimated accuracy on the same unlabeled examples. We refer to this metric as Mean Absolute estimation Error (MAE). Along with MAE, we also show scatter plots to visualize performance at individual target sets. Refer to app:exp_setup for additional details on the setup.

Methods With ATC-NE, we denote ATC with negative entropy score function and with ATC-MC,
we denote ATC with maximum confidence score function.
For all methods, we implement *post-hoc* calibration
on validation source data
with Temperature Scaling (TS; guo2017calibration).
Below we briefly discuss baselines methods
compared in our work and relegate details to app:baselines.

*Average Confidence (AC). * Error is estimated as the expected value of the maximum softmax confidence on the target data, i.e, .

*Difference Of Confidence (DOC). * We estimate error on target by subtracting difference of confidences on source and target (as a surrogate to distributional distance guillory2021predicting) from the error on source distribution, i.e, . This is referred to as DOC-Feat in (guillory2021predicting).

*Importance re-weighting (IM). *
We estimate the error of the classifier
with importance
re-weighting of 0-1 error in the
pushforward space
of the classifier.
This corresponds to Mandolin
using one slice based on the underlying
classifier confidence chen2021mandoline.

*Generalized Disagreement Equality (GDE). *
Error is estimated as the expected disagreement of two models (trained on the same training set but with different randomization) on target data (jiang2021assessing), i.e., where and are the two models. Note that GDE requires two models trained independently, doubling the computational overhead while training.

### 5.1 Results

In table:error_estimation, we report MAE results aggregated by the nature of the shift in our testbed. In fig:scatter_plot and fig:intro(right), we show scatter plots for predicted accuracy versus OOD accuracy on several datasets. We include scatter plots for all datasets and parallel results with other architectures in app:results. In app:cifar_result, we also perform ablations on CIFAR using a pre-trained model and observe that pre-training doesn’t change the efficacy of ATC.

We predict accuracy on the target data before and after calibration with TS. First, we observe that both ATC-NE and ATC-MC (even without TS) obtain significantly lower MAE when compared with other methods (even with TS). Note that with TS we observe substantial improvements in MAE for all methods. Overall, ATC-NE (with TS) typically achieves the smallest MAE improving by more than on CIFAR and by – on ImageNet over GDE (the next best alternative to ATC). Alongside, we also observe that a linear fit with robust regression (siegel1982robust) on the scatter plot recovers a line close to for ATC-NE with TS while the line is far away from for other methods (fig:scatter_plot and fig:intro(right)). Remarkably, MAE is in the range of – with ATC for CIFAR, ImageNet, MNIST, and Wilds. However, MAE is much higher on Breeds benchmark with novel subpopulations. While we observe a small MAE (i.e., comparable to our observations on other datasets) on Breeds with natural and synthetic shifts from the same sub-population, MAE on shifts with novel population is significantly higher with all methods. Note that even on novel populations, ATC continues to dominate all other methods across all datasets in Breeds.

Additionally, for different subpopulations in setup, we observe a poor linear correlation of the estimated performance with the actual performance as shown in fig:ablation (left)(we notice a similar gap in the linear fit for all other methods). Hence in such a setting, we would expect methods that fine-tune a regression model on labeled target examples from shifts with one subpopulation will perform poorly on shifts with different subpopulations. Corroborating this intuition, next, we show that even after fitting a regression model for DOC on natural and synthetic shifts with source subpopulations, ATC without regression model continues to outperform DOC with regression model on shifts with novel subpopulation.

Fitting a regression model on with DOC. Using label target data from natural and synthetic shifts for the same subpopulation (same as source), we fit a robust linear regression model (siegel1982robust) to fine-tune DOC as in guillory2021predicting. We then evaluate the fine-tuned DOC (i.e., DOC with linear model) on natural and synthetic shifts from novel subpopulations on benchmark. Although we observe significant improvements in the performance of fine-tuned DOC when compared with DOC (without any fine-tuning), ATC without any regression model continues to perform better (or similar) to that of fine-tuned DOC on novel subpopulations (fig:ablation (middle)). Refer to app:breeeds_ablation for details and table:breeds_regression for MAE on with regression model.

## 6 Investigating ATC on Toy Model

In this section, we propose
and analyze a simple theoretical
model that distills empirical phenomena
from the previous section
and highlights efficacy of ATC.
Here, our aim is not to obtain a general
model that captures complicated real
distributions on high dimensional input space
as the images in ImageNet. Instead to further
our understanding, we focus on an
*easy-to-learn* binary classification task from
nagarajan2020understanding
with linear classifiers,
that is rich enough to
exhibit some of the same phenomena
as with deep networks on real data distributions.

Consider a easy-to-learn binary classification problem with two features where is fully predictive invariant feature with a margin and is a spurious feature (i.e., a feature that is correlated but not predictive of the true label). Conditional on , the distribution over is given as follows: and , where is a fixed constant greater than . For simplicity, we assume that label distribution on source is uniform on . is distributed such that , where controls the degree of spurious correlation. To model distribution shift, we simulate target data with different degree of spurious correlation, i.e., in target distribution . Note that here we do not consider shifts in the label distribution but our result extends to arbitrary shifts in the label distribution as well.

In this setup, we examine linear sigmoid classifiers of the form where . While there exists a linear classifier with that correctly classifies all the points with a margin , nagarajan2020understanding demonstrated that a linear classifier will typically have a dependency on the spurious feature, i.e.,

. They show that due to geometric skews, despite having positive dependencies on the invariant feature, a max-margin classifier trained on finite samples relies on the spurious feature. Refer to app:toy_model for more details on these skews. In our work, we show that given a linear classifier that relies on the spurious feature and achieves a non-trivial performance on the source (i.e.,

), ATC with maximum confidence score function*consistently*estimates the accuracy on the target distribution.

[Informal] Given any classifier with in the above setting, the threshold obtained in (1) together with ATC as in (2) with maximum confidence score function obtains a consistent estimate of the target accuracy.

Consider a classifier that depends positively on the spurious feature (i.e., ). Then as the spurious correlation decreases in the target data, the classifier accuracy on the target will drop and vice-versa if the spurious correlation increases on the target data. thm:toy_theory shows that the threshold identified with ATC as in (1) remains invariant as the distribution shifts and hence ATC as in (2) will correctly estimate the accuracy with shifting distributions. Next, we illustrate thm:toy_theory by simulating the setup empirically. First we pick a arbitrary classifier (which can also be obtained by training on source samples), tune the threshold on hold-out source examples and predict accuracy with different methods as we shift the distribution by varying the degree of spurious correlation.

Empirical validation and comparison with other methods. fig:ablation(right) shows that as the degree of spurious correlation varies, our method accurately estimates the target performance where all other methods fail to accurately estimate the target performance. Understandably, due to poor calibration of the sigmoid linear classifier AC, DOC and GDE fail. While in principle IM can perfectly estimate the accuracy on target in this case, we observe that it is highly sensitive to the number bins and choice of histogram binning (i.e., uniform mass or equal width binning). We elaborate more on this in app:toy_model.

Biased estimation with ATC. Now we discuss changes in the above setup where ATC yields inconsistent estimates. We assumed that both in source and target is uniform between and is uniform between . Shifting the support of target class conditional may introduce a bias in ATC estimates, e.g., shrinking the support to (

) (while maintaining uniform distribution) in the target will lead to an over-estimation of the target performance with ATC. In app:general_result, we elaborate on this failure and present a general (but less interpretable) classifier dependent distribution shift condition where ATC is guaranteed to yield consistent estimates.

## 7 Conclusion and future work

In this work, we proposed ATC, a simple method for estimating target domain accuracy based on unlabeled target (and labeled source data). ATC achieves remarkably low estimation error on several synthetic and natural shift benchmarks in our experiments. Notably, our work draws inspiration from recent state-of-the-art methods that use softmax confidences below a certain threshold for OOD detection (hendrycks2016baseline; hendrycks2019scaling) and takes a step forward in answering questions raised in deng2021labels about the practicality of threshold based methods.

Our distribution shift toy model justifies ATC on an easy-to-learn binary classification task. In our experiments, we also observe that calibration significantly improves estimation with ATC. Since in binary classification, post hoc calibration with TS does not change the effective threshold, in future work, we hope to extend our theoretical model to multi-class classification to understand the efficacy of calibration. Our theory establishes that a classifier’s accuracy is not, in general identified, from labeled source and unlabeled target data alone, absent considerable additional constraints on the target conditional . In light of this finding, we also hope to extend our understanding beyond the simple theoretical toy model to characterize broader sets of conditions under which ATC might be guaranteed to obtain consistent estimates. Finally, we should note that while ATC outperforms previous approaches, it still suffers from large estimation error on datasets with novel populations, e.g., . We hope that our findings can lay the groundwork for future work for improving accuracy estimation on such datasets.

#### Reproducibility Statement

We have been careful to ensure that our results are reproducible. We have stored all models and logged all hyperparameters and seeds to facilitate reproducibility. Note that throughout our work, we do not perform any hyperparameter tuning, instead, using benchmarked hyperparameters and training procedures to make our results easy to reproduce. While, we have not released code yet, the appendix provides all the necessary details to replicate our experiments and results. Moreover, we plan to release the code with a revised version of the manuscript.

## References

## Appendix

## Appendix A Proofs from sec:setup

Before proving results from sec:setup, we introduce some notations.
Define .
We express the *population error* on distribution as
.

###### Proof of prop:characterization.

Consider a binary classification problem. Assume be the set of possible target conditional distribution of labels given and .

The forward direction is simple. If is singleton given and , then the error of any classifier on the target domain is identified and is given by

(3) |

For the reverse direction assume that given and , we have two possible distributions and with such that on some with , we have . Consider be the set of all input covariates where the two distributions differ. We will now choose a classifier such that the error on the two distributions differ. On a subset , assume and on a subset , assume . We will show that the error of on distribution with is strictly greater than the error of on distribution with . Formally,

where the last step follows by construction of the set and . Since , given the information of and it is impossible to distinguish the two values of the error with classifier . Thus, we obtain a contradiction on the assumption that . Hence, we must pose restrictions on the nature of shift such that is singleton to to identify accuracy on the target. ∎

###### Proof of corollary:impossible.

The corollary follows directly from prop:characterization. Since two different target conditional distribution can lead to different error estimates without assumptions on the classifier, no method can estimate two different quantities from the same given information. We illustrate this in Example 1 next. ∎

## Appendix B Estimating accuracy in covariate shift or label shift

Accuracy estimation under covariate shift assumption Under the assumption that , accuracy on the target domain can be estimated as follows:

(4) | ||||

(5) |

Given access to and , one can directly estimate the expression in (5).

Accuracy estimation under label shift assumption Under the assumption that , accuracy on the target domain can be estimated as follows:

(6) | ||||

(7) |

Estimating importance ratios is straightforward under covariate shift assumption when the distributions and

are known. For label shift, one can leverage moment matching approach called BBSE

(lipton2018detecting) or likelihood minimization approach MLLS (garg2020unified). Below we discuss the objective of MLLS:(8) |

where . MLLS objective is guaranteed to obtain consistent estimates for the importance ratios under the following condition. [Theorem 1 (garg2020unified)] If the distributions are strictly linearly independent, then is the unique maximizer of the MLLS objective (8). We refer interested reader to garg2020unified for details.

Above results of accuracy estimation under label shift and covariate shift
can be extended to a generalized label shift and covariate shift settings.
Assume a function such that is independent of given . In other words contains all the information needed to predict label . With help of , we can extend estimation to following settings: (i) *Generalized covariate shift*, i.e., and for all ; (ii) *Generalized label shift*, i.e., and for all . By simply replacing
with in (5) and (8), we will obtain consistent error estimates under these generalized conditions.

###### Proof of Example 1.

Then is given by

If , then and if , then . Since for arbitrary , given access to , and , any method that consistently estimates error under covariate shift will give an incorrect estimate under label shift and vice-versa. The reason being that the same and can correspond to error (under covariate shift) or error (under label shift) either of which is not discernable absent further assumptions on the nature of shift. ∎

## Appendix C Alternate interpretation of ATC

Consider the following framework: Given a datum , define a binary classification problem of whether the model prediction was correct or incorrect. In particular, if the model prediction matches the true label, then we assign a label 1 (positive) and conversely, if the model prediction doesn’t match the true label then we assign a label 0 (negative).

Our method can be interpreted as
identifying examples for correct and
incorrect prediction based on the value of the
score function , i.e.,
if the score is greater
than or equal to the threshold then our method
predicts that the classifier correctly
predicted datum and vice-versa if
the score is less than .
A method that can solve this task
will perfectly estimate the target performance.
However, such an expectation is unrealistic.
Instead, ATC expects that *most* of the
examples with score above threshold are correct
and most of the examples below the threshold are
incorrect.
More importantly, ATC selects a threshold
such that the number of
falsely identified correct
predictions match falsely
identified incorrect predictions
on source distribution, thereby
balancing incorrect predictions.
We expect useful estimates of accuracy
with ATC if the threshold transfers to target,
i.e.
if the number of
falsely identified correct
predictions match falsely
identified incorrect predictions on target.
This interpretation relates our method
to the OOD detection literature where
hendrycks2016baseline; hendrycks2019scaling
highlight that classifiers
tend to assign higher confidence to
in-distribution examples and leverage
maximum softmax confidence (or logit)
to perform OOD detection.

## Appendix D Details on the Toy Model

Skews observed in this toy model In fig:toy_model, we illustrate the toy model used in our empirical experiment.
In the same setup, we empirically observe that the margin on population with less density is large, i.e., margin is much greater than when the number of observed samples is small (in fig:toy_model (d)). Building on this observation, nagarajan2020understanding showed in cases when margin decreases with number of samples, a max margin classifier trained on finite samples is bound to depend on the spurious features in such cases. They referred to this skew as *geometric skew*.

Moreover, even when the number of samples are large so that we do not observe geometric skews, nagarajan2020understanding

showed that training for finite number of epochs, a linear classifier will have a non zero dependency on the spurious feature. They referred to this skew as

*statistical skew*. Due both of these skews, we observe that a linear classifier obtained with training for finite steps on training data with finite samples, will have a non-zero dependency on the spurious feature. We refer interested reader to nagarajan2020understanding for more details.

Proof of thm:toy_theory Recall, we consider a easy-to-learn binary classification problem with two features where is fully predictive invariant feature with a margin and is a spurious feature (i.e., a feature that is correlated but not predictive of the true label). Conditional on , the distribution over is given as follows:

(9) |

where is a fixed constant greater than . For simplicity, we assume that label distribution on source is uniform on . is distributed such that , where controls the degree of spurious correlation. To model distribution shift, we simulate target data with different degree of spurious correlation, i.e., in target distribution . Note that here we do not consider shifts in the label distribution but our result extends to arbitrary shifts in the label distribution as well.

In this setup, we examine linear sigmoid classifiers of the form
where .
We show that given a linear classifier
that relies on the spurious feature
and achieves a non-trivial performance on
the source (i.e., ),
ATC with maximum confidence score function
*consistently*
estimates the accuracy on the target distribution.
Define
and .
Notice that in target distributions, we are
changing the fraction of examples in
and but we are not changing the
distribution of examples within individual set.
Given any classifier with in the above setting,
assume that the threshold is obtained with finite sample
approximation of (1), i.e., is
selected such that^{1}^{1}1Note that this is possible
because a linear classifier with sigmoid activation
assigns a unique score to each point in source distribution.

(10) |

where are samples from source distribution. Fix a . Assuming , then the estimate of accuracy by ATC as in (2) satisfies the following with probability at least ,

(11) |

where is any target distribution considered in our setting and if and otherwise.

###### Proof.

First we consider the case of . The proof follows in two simple steps. First we notice that the classifier will make an error only on some points in and the threshold will be selected such that the fraction of points in with maximum confidence less than the threshold will match the error of the classifier on . Classifier with and will classify all the points in correctly.
Second, since the distribution of points is not changing within and , the same threshold continues to work for arbitrary shift in the fraction of examples in , i.e., .

Note that when , the classifier makes no error on points in and makes an error on a subset of , i.e., . Consider as the set of points that obtain a score less than or equal to .
Now we will show that ATC chooses a threshold such that all points in gets a score above , i.e., . First note that the score of points close to the true separator in , i.e., at and match. In other words, score at matches with the score of by symmetricity, i.e.,

(12) |

Hence, if then we will have which is contradiction violating definition of as in (10). Thus .

Now we will relate LHS and RHS of (10) with their expectations using Hoeffdings and DKW inequality to conclude (11). Using Hoeffdings’ bound, we have with probability at least

(13) |

With DKW inequality, we have with probability at least

(14) |

for all . Combining (13) and (14) at with definition (10), we have with probability at least

(15) |

Now for the case of , we can use the same arguments on . That is, since now all the error will be on points in and classifier will make no error , we can show that threshold will be selected such that the fraction of points in with maximum confidence less than the threshold will match the error of the classifier on . Again, since the distribution of points is not changing within and , the same threshold continues to work for arbitrary shift in the fraction of examples in , i.e., . Thus with similar arguments, we have

Comments

There are no comments yet.