Leveraging Unlabeled Data to Predict Out-of-Distribution Performance

01/11/2022
by   Saurabh Garg, et al.
71

Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions that may cause performance drops. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples for which model confidence exceeds that threshold. ATC outperforms previous methods across several model architectures, types of distribution shifts (e.g., due to synthetic corruptions, dataset reproduction, or novel subpopulations), and datasets (Wilds, ImageNet, Breeds, CIFAR, and MNIST). In our experiments, ATC estimates target performance 2-4× more accurately than prior methods. We also explore the theoretical foundations of the problem, proving that, in general, identifying the accuracy is just as hard as identifying the optimal predictor and thus, the efficacy of any method rests upon (perhaps unstated) assumptions on the nature of the shift. Finally, analyzing our method on some toy distributions, we provide insights concerning when it works.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/22/2019

Regularized Learning for Domain Adaptation under Label Shifts

We propose Regularized Learning under Label shifts (RLLS), a principled ...
12/09/2021

Extending the WILDS Benchmark for Unsupervised Adaptation

Machine learning systems deployed in the wild are often trained on a sou...
07/07/2021

Predicting with Confidence on Unseen Distributions

Recent work has shown that the performance of machine learning models ca...
02/26/2020

Understanding Self-Training for Gradual Domain Adaptation

Machine learning systems must adapt to data distributions that evolve ov...
11/22/2021

DAPPER: Performance Estimation of Domain Adaptation in Mobile Sensing

Many applications that utilize sensors in mobile devices and apply machi...
07/09/2021

Accuracy on the Line: On the Strong Correlation Between Out-of-Distribution and In-Distribution Generalization

For machine learning systems to be reliable, we must understand their pe...
08/04/2020

Out-of-Distribution Generalization with Maximal Invariant Predictor

Out-of-Distribution (OOD) generalization problem is a problem of seeking...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning models deployed in the real world typically encounter examples from previously unseen distributions. While the IID assumption enables us to evaluate models using held-out data from the source distribution (from which training data is sampled), this estimate is no longer valid in presence of a distribution shift. Moreover, under such shifts, model accuracy tends to degrade (szegedy2013intriguing; recht2019imagenet; wilds2021). Commonly, the only data available to the practitioner are a labeled training set (source) and unlabeled deployment-time data which makes the problem more difficult. In this setting, detecting shifts in the distribution of covariates is known to be possible (but difficult) in theory (ramdas2015decreasing), and in practice (rabanser2018failing). However, producing an optimal predictor using only labeled source and unlabeled target data is well-known to be impossible absent further assumptions (ben2010impossibility; lipton2018detecting).

Two vital questions that remain are: (i) the precise conditions under which we can estimate a classifier’s target-domain accuracy; and (ii) which methods are most practically useful. To begin, the straightforward way to assess the performance of a model under distribution shift would be to collect labeled (target domain) examples and then to evaluate the model on that data. However, collecting fresh labeled data from the target distribution is prohibitively expensive and time-consuming, especially if the target distribution is non-stationary. Hence, instead of using labeled data, we aim to use unlabeled data from the target distribution, that is comparatively abundant, to predict model performance. Note that in this work, our focus is

not to improve performance on the target but, rather, to estimate the accuracy on the target for a given classifier.

Recently, numerous methods have been proposed for this purpose (deng2021labels; chen2021mandoline; jiang2021assessing; deng2021does; guillory2021predicting). These methods either require calibration on the target domain to yield consistent estimates (jiang2021assessing; guillory2021predicting)

or additional labeled data from several target domains to learn a linear regression function on a distributional distance that then predicts model performance 

(deng2021does; deng2021labels; guillory2021predicting). However, methods that require calibration on the target domain typically yield poor estimates since deep models trained and calibrated on source data are not, in general, calibrated on a (previously unseen) target domain (ovadia2019can). Besides, methods that leverage labeled data from target domains rely on the fact that unseen target domains exhibit strong linear correlation with seen target domains on the underlying distance measure and, hence, can be rendered ineffective when such target domains with labeled data are unavailable (in sec:exp_results we demonstrate such a failure on a real-world distribution shift problem). Therefore, throughout the paper, we assume access to labeled source data and only unlabeled data from target domain(s).

Figure 1: Illustration of our proposed method ATC. Left: using source domain validation data, we identify a threshold on a score (e.g. negative entropy) computed on model confidence such that fraction of examples above the threshold matches the validation set accuracy. ATC estimates accuracy on unlabeled target data as the fraction of examples with the score above the threshold. Interestingly, this threshold yields accurate estimates on a wide set of target distributions resulting from natural and synthetic shifts. Right: Efficacy of ATC over previously proposed approaches on our testbed with a post-hoc calibrated model. To obtain errors on the same scale, we rescale all errors with Average Confidence (AC) error. Lower estimation error is better. See table:error_estimation for exact numbers and comparison on various types of distribution shift. See sec:exp for details on our testbed.

In this work, we first show that absent assumptions on the source classifier or the nature of the shift, no method of estimating accuracy will work generally (even in non-contrived settings). To estimate accuracy on target domain perfectly, we highlight that even given perfect knowledge of the labeled source distribution (i.e., ) and unlabeled target distribution (i.e., ), we need restrictions on the nature of the shift such that we can uniquely identify the target conditional . Thus, in general, identifying the accuracy of the classifier is as hard as identifying the optimal predictor.

Second, motivated by the superiority of methods that use maximum softmax probability (or logit) of a model for Out-Of-Distribution (OOD) detection 

(hendrycks2016baseline; hendrycks2019scaling), we propose a simple method that leverages softmax probability to predict model performance. Our method, Average Thresholded Confidence (ATC), learns a threshold on a score (e.g., maximum confidence or negative entropy) of model confidence on validation source data and predicts target domain accuracy as the fraction of unlabeled target points that receive a score above that threshold. ATC selects a threshold on validation source data such that the fraction of source examples that receive the score above the threshold match the accuracy of those examples. Our primary contribution in ATC is the proposal of obtaining the threshold and observing its efficacy on (practical) accuracy estimation. Importantly, our work takes a step forward in positively answering the question raised in deng2021labels; deng2021does about a practical strategy to select a threshold that enables accuracy prediction with thresholded model confidence.

ATC is simple to implement with existing frameworks, compatible with arbitrary model classes, and dominates other contemporary methods. Across several model architectures on a range of benchmark vision and language datasets, we verify that ATC outperforms prior methods by at least in predicting target accuracy on a variety of distribution shifts. In particular, we consider shifts due to common corruptions (e.g., ImageNet-C), natural distribution shifts due to dataset reproduction (e.g., ImageNet-v2, ImageNet-R), shifts due to novel subpopulations (e.g., Breeds), and distribution shifts faced in the wild (e.g., Wilds).

As a starting point for theory development, we investigate ATC on a simple toy model that models distribution shift with varying proportions of the population with spurious features, as in nagarajan2020understanding. Finally, we note that although ATC achieves superior performance in our empirical evaluation, like all methods, it must fail (returns inconsistent estimates) on certain types of distribution shifts, per our impossibility result.

2 Prior Work

Out-of-distribution detection. The main goal of OOD detection is to identify previously unseen examples, i.e., samples out of the support of training distribution. To accomplish this, modern methods utilize confidence or features learned by a deep network trained on some source data. hendrycks2016baseline; geifman2017selective used the confidence score of an (already) trained deep model to identify OOD points. lakshminarayanan2016simple use entropy of an ensemble model to evaluate prediction uncertainty on OOD points. To improve OOD detection with model confidence, liang2017enhancing propose to use temperature scaling and input perturbations. jiang2018trust propose to use scores based on the relative distance of the predicted class to the second class. Recently, residual flow-based methods were used to obtain a density model for OOD detection (zhang2020hybrid). ji2021predicting proposed a method based on subfunction error bounds to compute unreliability per sample. Refer to ovadia2019can; ji2021predicting for an overview and comparison of methods for prediction uncertainty on OOD data.

Predicting model generalization. Understanding generalization capabilities of overparameterized models on in-distribution data using conventional machine learning tools has been a focus of a long line of work; representative research includes neyshabur2015norm; neyshabur2017exploring; neyshabur2017implicit; neyshabur2018role; dziugaite2017computing; bartlett2017spectrally; zhou2018non; long2019generalization; nagarajan2019deterministic. At a high level, this line of research bounds the generalization gap directly with complexity measures calculated on the trained model. However, these bounds typically remain numerically loose relative to the true generalization error (zhang2016understanding; nagarajan2019uniform). On the other hand, another line of research departs from complexity-based approaches to use unseen unlabeled data to predict in-distribution generalization (platanios2016estimating; platanios2017estimating; garg2021ratt; jiang2021assessing).

Relevant to our work are methods for predicting the error of a classifier on OOD data based on unlabeled data from the target (OOD) domain. These methods can be characterized into two broad categories: (i) Methods which explicitly predict correctness of the model on individual unlabeled points (deng2021labels; jiang2021assessing; deng2021does); and (ii) Methods which directly obtain an estimate of error with unlabeled OOD data without making a point-wise prediction (chen2021mandoline; guillory2021predicting; chuang2020estimating).

To achieve a consistent estimate of the target accuracy,  jiang2021assessing; guillory2021predicting require calibration on target domain. However, these methods typically yield poor estimates as deep models trained and calibrated on some source data are seldom calibrated on previously unseen domains (ovadia2019can). Additionally, deng2021labels; guillory2021predicting derive model-based distribution statistics on unlabeled target set that correlate with the target accuracy and propose to use a subset of labeled target domains to learn a (linear) regression function that predicts model performance. However, there are two drawbacks with this approach: (i) the correlation of these distribution statistics can vary substantially as we consider different nature of shifts (refer to sec:exp_results, where we empirically demonstrate this failure); (ii) even if there exists a (hypothetical) statistic with strong correlations, obtaining labeled target domains (even simulated ones) with strong correlations would require significant a priori knowledge about the nature of shift that, in general, might not be available before models are deployed in the wild. Nonetheless, in our work, we only assume access to labeled data from the source domain presuming no access to labeled target domains or information about how to simulate them.

Moreover, unlike the parallel work of deng2021does, we do not focus on methods that alter the training on source data to aid accuracy prediction on the target data. chen2021mandoline propose an importance re-weighting based approach that leverages (additional) information about the axis along which distribution is shifting in form of “slicing functions”. In our work, we make comparisons with importance re-weighting baseline from chen2021mandoline as we do not have any additional information about the axis along which the distribution is shifting.

3 Problem Setup

Notation. By , and

we denote the Euclidean norm and inner product, respectively. For a vector

, we use to denote its entry, and for an event we let denote the binary indicator of the event.

Suppose we have a multi-class classification problem with the input domain and label space . For binary classification, we use . By and , we denote source and target distribution over . For distributions and , we define or as the corresponding probability density (or mass) functions. A dataset contains points sampled i.i.d. from . Let be a class of hypotheses mapping to where is a simplex in dimensions. Given a classifier and datum , we denote the 0-1 error (i.e., classification error) on that point by . Given a model , our goal in this work is to understand the performance of on without access to labeled data from . Note that our goal is not to adapt the model to the target data. Concretely, we aim to predict accuracy of on . Throughout this paper, we assume we have access to the following: (i) model ; (ii) previously-unseen (validation) data from ; and (iii) unlabeled data from target distribution .

3.1 Accuracy Estimation: Possibility and Impossibility Results

First, we investigate the question of when it is possible to estimate the target accuracy of an arbitrary classifier, even given knowledge of the full source distribution and target marginal . Absent assumptions on the nature of shift, estimating target accuracy is impossible. Even given access to and , the problem is fundamentally unidentifiable because can shift arbitrarily. In the following proposition, we show that absent assumptions on the classifier (i.e., when can be any classifier in the space of all classifiers on ), we can estimate accuracy on the target data iff assumptions on the nature of the shift, together with and , uniquely identify the (unknown) target conditional . We relegate proofs from this section to app:proof_setup.

Absent further assumptions, accuracy on the target is identifiable iff is uniquely identified given and .

prop:characterization states that we need enough constraints on nature of shift such that and identifies unique . It also states that under some assumptions on the nature of the shift, we can hope to estimate the model’s accuracy on target data. We will illustrate this on two common assumptions made in domain adaptation literature: (i) covariate shift (heckman1977sample; shimodaira2000improving) and (ii) label shift (saerens2002adjusting; zhang2013domain; lipton2018detecting). Under covariate shift assumption, that the target marginal support is a subset of the source marginal support and that the conditional distribution of labels given inputs does not change within support, i.e., , which, trivially, identifies a unique target conditional . Under label shift, the reverse holds, i.e., the class-conditional distribution does not change () and, again, information about uniquely determines the target conditional  (lipton2018detecting; garg2020unified). In these settings, one can estimate an arbitrary classifier’s accuracy on the target domain either by using importance re-weighting with the ratio in case of covariate shift or by using importance re-weighting with the ratio in case of label shift. While importance ratios in the former case can be obtained directly when and are known, the importance ratios in the latter case can be obtained by using techniques from  saerens2002adjusting; lipton2018detecting; azizzadenesheli2019regularized; alexandari2019adapting. In app:estimate_label_covariate,we explore accuracy estimation in the setting of these shifts and present extensions to generalized notions of label shift (tachet2020domain) and covariate shift (rojas2018invariant).

As a corollary of prop:characterization, we now present a simple impossibility result, demonstrating that no single method can work for all families of distribution shift.

Absent assumptions on the classifier , no method of estimating accuracy will work in all scenarios, i.e., for different nature of distribution shifts. Intuitively, this result states that every method of estimating accuracy on target data is tied up with some assumption on the nature of the shift and might not be useful for estimating accuracy under a different assumption on the nature of the shift. For illustration, consider a setting where we have access to distribution and . Additionally, assume that the distribution can shift only due to covariate shift or label shift without any knowledge about which one. Then corollary:impossible says that it is impossible to have a single method that will simultaneously for both label shift and covariate shift as in the following example (we spell out the details in app:proof_setup):

Example 1. Assume binary classification with , , , and where , , and . Error of a classifier on target data is given by under covariate shift and by under label shift. In app:proof_setup, we show that for all . Thus, given access to , and , any method that consistently estimates error of a classifer under covariate shift will give an incorrect estimate of error under label shift and vice-versa. The reason is that the same and can correspond to error (under covariate shift) or error (under label shift) and determining which scenario one faces requires further assumptions on the nature of shift.

4 Predicting accuracy with Average Thresholded Confidence

In this section, we present our method ATC that leverages a black box classifier and (labeled) validation source data to predict accuracy on target domain given access to unlabeled target data. Throughout the discussion, we assume that the classifier is fixed.

Before presenting our method, we introduce some terminology. Define a score function that takes in the softmax prediction of the function and outputs a scalar. We want a score function such that if the score function takes a high value at a datum then is likely to be correct. In this work, we explore two such score functions: (i) Maximum confidence, i.e., ; and (ii) Negative Entropy, i.e., . Our method identifies a threshold on source data such that the expected number of points that obtain a score less than match the error of on , i.e.,

(1)

and then our error estimate on the target domain is given by the expected number of target points that obtain a score less than , i.e.,

(2)

In short, in (1), ATC selects a threshold on the score function such that the error in the source domain matches the expected number of points that receive a score below and in (2), ATC predicts error on the target domain as the fraction of unlabeled points that obtain a score below that threshold . Note that, in principle, there exists a different threshold on the target distribution such that (1) is satisfied on . However, in our experiments, the same threshold performs remarkably well. The main empirical contribution of our work is to show that the threshold obtained with (1) might be used effectively in condunction with modern deep networks in a wide range of settings to estimate error on the target data. In practice, to obtain the threshold with ATC, we minimize the difference between the expression on two sides of (1) using finite samples. In the next section, we show that ATC precisely predicts accuracy on the OOD data on the desired line . In app:interpretation, we discuss an alternate interpretation of the method and make connections with OOD detection methods.

5 Experiments

Figure 2: Scatter plot of predicted accuracy versus (true) OOD accuracy. Each point denotes a different OOD dataset, all evaluated with the same DenseNet121 model. We only plot the best three methods. With ATC (ours), we refer to ATC-NE. We observe that ATC significantly outperforms other methods and with ATC, we recover the desired line with a robust linear fit. Aggregated estimation error in table:error_estimation and plots for other datasets and architectures in app:results.

We now empirical evaluate ATC and compare it with existing methods. In each of our main experiment, keeping the underlying model fixed, we vary target datasets and make a prediction of the target accuracy with various methods given access to only unlabeled data from the target. Unless noted otherwise, all models are trained only on samples from the source distribution with the main exception of pre-training on a different distribution. We use labeled examples from the target distribution to only obtain true error estimates.

Datasets. First, we consider synthetic shifts induced due to different visual corruptions (e.g., shot noise, motion blur etc.) under ImageNet-C (hendrycks2019benchmarking). Next, we consider natural shifts due to differences in the data collection process of ImageNet (russakovsky2015imagenet), e.g, ImageNetv2 (recht2019imagenet). We also consider images with artistic renditions of object classes, i.e., ImageNet-R (hendrycks2021many) and ImageNet-Sketch (wang2019learning). Note that renditions dataset only contains a subset classes from ImageNet. To include renditions dataset in our testbed, we include results on ImageNet restricted to these classes (which we call ImageNet-200) along with full ImageNet.

Second, we consider Breeds (santurkar2020breeds) to assess robustness to subpopulation shifts, in particular, to understand how accuracy estimation methods behave when novel subpopulations not observed during training are introduced. Breeds leverages class hierarchy in ImageNet to create 4 datasets Entity-13, Entity-30, Living-17, Non-living-26. We focus on natural and synthetic shifts as in ImageNet on same and different subpopulations in BREEDs. Third, from Wilds (wilds2021) benchmark, we consider FMoW-Wilds (christie2018functional), RxRx1-Wilds (taylor2019rxrx1), Amazon-Wilds (ni2019justifying), CivilComments-Wilds (borkan2019nuanced) to consider distribution shifts faced in the wild.

Finally, similar to ImageNet, we consider (i) synthetic shifts (CIFAR-10-C) due to common corruptions; and (ii) natural shift (i.e., CIFARv2 (recht2018cifar)) on CIFAR-10 (krizhevsky2009learning). On CIFAR-100, we just have synthetic shifts due to common corruptions. For completeness, we also consider natural shifts on MNIST (lecun1998mnist) as in the prior work (deng2021labels). We use three real shifted datasets, i.e., USPS (hull1994database)

, SVHN 

(netzer2011reading) and QMNIST (qmnist-2019). We give a detailed overview of our setup in app:dataset.

Architectures and Evaluation. For ImageNet, Breeds, CIFAR, FMoW-Wilds, RxRx1-Wilds datasets, we use DenseNet121 (huang2017densely) and ResNet50 (he2016deep) architectures. For Amazon-Wilds and CivilComments-Wilds, we fine-tune a DistilBERT-base-uncased (Sanh2019DistilBERTAD) model.

For MNIST, we train a fully connected multilayer perceptron. We use standard training with benchmarked hyperparameters. To compare methods, we report average absolute difference between the true accuracy on the target data and the estimated accuracy on the same unlabeled examples. We refer to this metric as Mean Absolute estimation Error (MAE). Along with MAE, we also show scatter plots to visualize performance at individual target sets. Refer to app:exp_setup for additional details on the setup.

width=center Dataset Shift IM AC DOC GDE ATC-MC (Ours) ATC-NE (Ours) Pre T Post T Pre T Post T Pre T Post T Post T Pre T Post T Pre T Post T CIFAR10 Natural Synthetic CIFAR100 Synthetic ImageNet200 Natural Synthetic ImageNet Natural Synthetic FMoW-wilds Natural RxRx1-wilds Natural Amazon-wilds Natural CivilCom.-wilds Natural MNIST Natural Entity-13 Same Novel Entity-30 Same Novel Nonliving-26 Same Novel Living-17 Same Novel

Table 1: Mean Absolute estimation Error (MAE) results for different datasets in our setup grouped by the nature of shift. ‘Same’ refers to same subpopulation shifts and ‘Novel’ refers novel subpopulation shifts. We include details about the target sets considered in each shift in table:dataset. Post T denotes use of TS calibration on source. Across all datasets, we observe that ATC achieves superior performance (lower MAE is better). For language datasets, we use DistilBERT-base-uncased, for vision dataset we report results with DenseNet model with the exception of MNIST where we use FCN. We include results on other architectures in app:results. For GDE post T and pre T estimates match since TS doesn’t alter the argmax prediction. Results reported by aggregating MAE numbers over

different seeds. We include results with standard deviation values in table:error_estimation_std.

Methods With ATC-NE, we denote ATC with negative entropy score function and with ATC-MC, we denote ATC with maximum confidence score function. For all methods, we implement post-hoc calibration on validation source data with Temperature Scaling (TS; guo2017calibration). Below we briefly discuss baselines methods compared in our work and relegate details to app:baselines.

Average Confidence (AC). Error is estimated as the expected value of the maximum softmax confidence on the target data, i.e, .

Difference Of Confidence (DOC). We estimate error on target by subtracting difference of confidences on source and target (as a surrogate to distributional distance guillory2021predicting) from the error on source distribution, i.e, . This is referred to as DOC-Feat in (guillory2021predicting).

Importance re-weighting (IM). We estimate the error of the classifier with importance re-weighting of 0-1 error in the pushforward space of the classifier. This corresponds to Mandolin using one slice based on the underlying classifier confidence chen2021mandoline.

Generalized Disagreement Equality (GDE). Error is estimated as the expected disagreement of two models (trained on the same training set but with different randomization) on target data (jiang2021assessing), i.e., where and are the two models. Note that GDE requires two models trained independently, doubling the computational overhead while training.

5.1 Results

Figure 3: Left: Predicted accuracy with DOC on Living17  dataset. We observe a substantial gap in the linear fit of same and different subpopulations highlighting poor correlation. Middle: After fitting a robust linear model for DOC on same subpopulation, we show predicted accuracy on different subpopulations with fine-tuned DOC (i.e., DOC (w/ fit)) and compare with ATC without any regression model, i.e., ATC (w/o fit). While observe substantial improvements in MAE from with DOC (w/o fit) to with DOC (w/ fit), ATC (w/o fit) continues to outperform even DOC (w/ fit) with MAE . We show parallel results with other  datasets in app:breeeds_ablation. Right : Empirical validation of our toy model. We show that ATC perfectly estimates target performance as we vary the degree of spurious correlation in target. ‘’ represents accuracy on source.

In table:error_estimation, we report MAE results aggregated by the nature of the shift in our testbed. In fig:scatter_plot and fig:intro(right), we show scatter plots for predicted accuracy versus OOD accuracy on several datasets. We include scatter plots for all datasets and parallel results with other architectures in app:results. In app:cifar_result, we also perform ablations on CIFAR using a pre-trained model and observe that pre-training doesn’t change the efficacy of ATC.

We predict accuracy on the target data before and after calibration with TS. First, we observe that both ATC-NE and ATC-MC (even without TS) obtain significantly lower MAE when compared with other methods (even with TS). Note that with TS we observe substantial improvements in MAE for all methods. Overall, ATC-NE (with TS) typically achieves the smallest MAE improving by more than on CIFAR and by on ImageNet over GDE (the next best alternative to ATC). Alongside, we also observe that a linear fit with robust regression (siegel1982robust) on the scatter plot recovers a line close to for ATC-NE with TS while the line is far away from for other methods (fig:scatter_plot and fig:intro(right)). Remarkably, MAE is in the range of with ATC for CIFAR, ImageNet, MNIST, and Wilds. However, MAE is much higher on Breeds benchmark with novel subpopulations. While we observe a small MAE (i.e., comparable to our observations on other datasets) on Breeds with natural and synthetic shifts from the same sub-population, MAE on shifts with novel population is significantly higher with all methods. Note that even on novel populations, ATC continues to dominate all other methods across all datasets in Breeds.

Additionally, for different subpopulations in  setup, we observe a poor linear correlation of the estimated performance with the actual performance as shown in fig:ablation (left)(we notice a similar gap in the linear fit for all other methods). Hence in such a setting, we would expect methods that fine-tune a regression model on labeled target examples from shifts with one subpopulation will perform poorly on shifts with different subpopulations. Corroborating this intuition, next, we show that even after fitting a regression model for DOC on natural and synthetic shifts with source subpopulations, ATC without regression model continues to outperform DOC with regression model on shifts with novel subpopulation.

Fitting a regression model on  with DOC. Using label target data from natural and synthetic shifts for the same subpopulation (same as source), we fit a robust linear regression model (siegel1982robust) to fine-tune DOC as in guillory2021predicting. We then evaluate the fine-tuned DOC (i.e., DOC with linear model) on natural and synthetic shifts from novel subpopulations on  benchmark. Although we observe significant improvements in the performance of fine-tuned DOC when compared with DOC (without any fine-tuning), ATC without any regression model continues to perform better (or similar) to that of fine-tuned DOC on novel subpopulations (fig:ablation (middle)). Refer to app:breeeds_ablation for details and table:breeds_regression for MAE on  with regression model.

6 Investigating ATC on Toy Model

In this section, we propose and analyze a simple theoretical model that distills empirical phenomena from the previous section and highlights efficacy of ATC. Here, our aim is not to obtain a general model that captures complicated real distributions on high dimensional input space as the images in ImageNet. Instead to further our understanding, we focus on an easy-to-learn binary classification task from nagarajan2020understanding with linear classifiers, that is rich enough to exhibit some of the same phenomena as with deep networks on real data distributions.

Consider a easy-to-learn binary classification problem with two features where is fully predictive invariant feature with a margin and is a spurious feature (i.e., a feature that is correlated but not predictive of the true label). Conditional on , the distribution over is given as follows: and , where is a fixed constant greater than . For simplicity, we assume that label distribution on source is uniform on . is distributed such that , where controls the degree of spurious correlation. To model distribution shift, we simulate target data with different degree of spurious correlation, i.e., in target distribution . Note that here we do not consider shifts in the label distribution but our result extends to arbitrary shifts in the label distribution as well.

In this setup, we examine linear sigmoid classifiers of the form where . While there exists a linear classifier with that correctly classifies all the points with a margin , nagarajan2020understanding demonstrated that a linear classifier will typically have a dependency on the spurious feature, i.e.,

. They show that due to geometric skews, despite having positive dependencies on the invariant feature, a max-margin classifier trained on finite samples relies on the spurious feature. Refer to app:toy_model for more details on these skews. In our work, we show that given a linear classifier that relies on the spurious feature and achieves a non-trivial performance on the source (i.e.,

), ATC with maximum confidence score function consistently estimates the accuracy on the target distribution.

[Informal] Given any classifier with in the above setting, the threshold obtained in (1) together with ATC as in (2) with maximum confidence score function obtains a consistent estimate of the target accuracy.

Consider a classifier that depends positively on the spurious feature (i.e., ). Then as the spurious correlation decreases in the target data, the classifier accuracy on the target will drop and vice-versa if the spurious correlation increases on the target data. thm:toy_theory shows that the threshold identified with ATC as in (1) remains invariant as the distribution shifts and hence ATC as in (2) will correctly estimate the accuracy with shifting distributions. Next, we illustrate thm:toy_theory by simulating the setup empirically. First we pick a arbitrary classifier (which can also be obtained by training on source samples), tune the threshold on hold-out source examples and predict accuracy with different methods as we shift the distribution by varying the degree of spurious correlation.

Empirical validation and comparison with other methods. fig:ablation(right) shows that as the degree of spurious correlation varies, our method accurately estimates the target performance where all other methods fail to accurately estimate the target performance. Understandably, due to poor calibration of the sigmoid linear classifier AC, DOC and GDE fail. While in principle IM can perfectly estimate the accuracy on target in this case, we observe that it is highly sensitive to the number bins and choice of histogram binning (i.e., uniform mass or equal width binning). We elaborate more on this in app:toy_model.

Biased estimation with ATC. Now we discuss changes in the above setup where ATC yields inconsistent estimates. We assumed that both in source and target is uniform between and is uniform between . Shifting the support of target class conditional may introduce a bias in ATC estimates, e.g., shrinking the support to (

) (while maintaining uniform distribution) in the target will lead to an over-estimation of the target performance with ATC. In app:general_result, we elaborate on this failure and present a general (but less interpretable) classifier dependent distribution shift condition where ATC is guaranteed to yield consistent estimates.

7 Conclusion and future work

In this work, we proposed ATC, a simple method for estimating target domain accuracy based on unlabeled target (and labeled source data). ATC achieves remarkably low estimation error on several synthetic and natural shift benchmarks in our experiments. Notably, our work draws inspiration from recent state-of-the-art methods that use softmax confidences below a certain threshold for OOD detection (hendrycks2016baseline; hendrycks2019scaling) and takes a step forward in answering questions raised in deng2021labels about the practicality of threshold based methods.

Our distribution shift toy model justifies ATC on an easy-to-learn binary classification task. In our experiments, we also observe that calibration significantly improves estimation with ATC. Since in binary classification, post hoc calibration with TS does not change the effective threshold, in future work, we hope to extend our theoretical model to multi-class classification to understand the efficacy of calibration. Our theory establishes that a classifier’s accuracy is not, in general identified, from labeled source and unlabeled target data alone, absent considerable additional constraints on the target conditional . In light of this finding, we also hope to extend our understanding beyond the simple theoretical toy model to characterize broader sets of conditions under which ATC might be guaranteed to obtain consistent estimates. Finally, we should note that while ATC outperforms previous approaches, it still suffers from large estimation error on datasets with novel populations, e.g., . We hope that our findings can lay the groundwork for future work for improving accuracy estimation on such datasets.

Reproducibility Statement

We have been careful to ensure that our results are reproducible. We have stored all models and logged all hyperparameters and seeds to facilitate reproducibility. Note that throughout our work, we do not perform any hyperparameter tuning, instead, using benchmarked hyperparameters and training procedures to make our results easy to reproduce. While, we have not released code yet, the appendix provides all the necessary details to replicate our experiments and results. Moreover, we plan to release the code with a revised version of the manuscript.

References

Appendix

Appendix A Proofs from  sec:setup

Before proving results from sec:setup, we introduce some notations. Define . We express the population error on distribution as .

Proof of prop:characterization.

Consider a binary classification problem. Assume be the set of possible target conditional distribution of labels given and .

The forward direction is simple. If is singleton given and , then the error of any classifier on the target domain is identified and is given by

(3)

For the reverse direction assume that given and , we have two possible distributions and with such that on some with , we have . Consider be the set of all input covariates where the two distributions differ. We will now choose a classifier such that the error on the two distributions differ. On a subset , assume and on a subset , assume . We will show that the error of on distribution with is strictly greater than the error of on distribution with . Formally,

where the last step follows by construction of the set and . Since , given the information of and it is impossible to distinguish the two values of the error with classifier . Thus, we obtain a contradiction on the assumption that . Hence, we must pose restrictions on the nature of shift such that is singleton to to identify accuracy on the target. ∎

Proof of corollary:impossible.

The corollary follows directly from prop:characterization. Since two different target conditional distribution can lead to different error estimates without assumptions on the classifier, no method can estimate two different quantities from the same given information. We illustrate this in Example 1 next. ∎

Appendix B Estimating accuracy in covariate shift or label shift

Accuracy estimation under covariate shift assumption Under the assumption that , accuracy on the target domain can be estimated as follows:

(4)
(5)

Given access to and , one can directly estimate the expression in (5).

Accuracy estimation under label shift assumption Under the assumption that , accuracy on the target domain can be estimated as follows:

(6)
(7)

Estimating importance ratios is straightforward under covariate shift assumption when the distributions and

are known. For label shift, one can leverage moment matching approach called BBSE 

(lipton2018detecting) or likelihood minimization approach MLLS (garg2020unified). Below we discuss the objective of MLLS:

(8)

where . MLLS objective is guaranteed to obtain consistent estimates for the importance ratios under the following condition. [Theorem 1 (garg2020unified)] If the distributions are strictly linearly independent, then is the unique maximizer of the MLLS objective (8). We refer interested reader to garg2020unified for details.

Above results of accuracy estimation under label shift and covariate shift can be extended to a generalized label shift and covariate shift settings. Assume a function such that is independent of given . In other words contains all the information needed to predict label . With help of , we can extend estimation to following settings: (i) Generalized covariate shift, i.e., and for all ; (ii) Generalized label shift, i.e., and for all . By simply replacing with in (5) and (8), we will obtain consistent error estimates under these generalized conditions.

Proof of Example 1.

Under covariate shift using (5), we get

Under label shift using (7), we get

Then is given by

If , then and if , then . Since for arbitrary , given access to , and , any method that consistently estimates error under covariate shift will give an incorrect estimate under label shift and vice-versa. The reason being that the same and can correspond to error (under covariate shift) or error (under label shift) either of which is not discernable absent further assumptions on the nature of shift. ∎

Appendix C Alternate interpretation of ATC

Consider the following framework: Given a datum , define a binary classification problem of whether the model prediction was correct or incorrect. In particular, if the model prediction matches the true label, then we assign a label 1 (positive) and conversely, if the model prediction doesn’t match the true label then we assign a label 0 (negative).

Our method can be interpreted as identifying examples for correct and incorrect prediction based on the value of the score function , i.e., if the score is greater than or equal to the threshold then our method predicts that the classifier correctly predicted datum and vice-versa if the score is less than . A method that can solve this task will perfectly estimate the target performance. However, such an expectation is unrealistic. Instead, ATC expects that most of the examples with score above threshold are correct and most of the examples below the threshold are incorrect. More importantly, ATC selects a threshold such that the number of falsely identified correct predictions match falsely identified incorrect predictions on source distribution, thereby balancing incorrect predictions. We expect useful estimates of accuracy with ATC if the threshold transfers to target, i.e. if the number of falsely identified correct predictions match falsely identified incorrect predictions on target. This interpretation relates our method to the OOD detection literature where hendrycks2016baseline; hendrycks2019scaling highlight that classifiers tend to assign higher confidence to in-distribution examples and leverage maximum softmax confidence (or logit) to perform OOD detection.

Appendix D Details on the Toy Model

Figure 4: Illustration of toy model. (a) Source data at . (b) Target data with . (b) Target data with . (c) Margin of in the minority group in source data. As sample size increases the margin saturates to true margin .

Skews observed in this toy model In fig:toy_model, we illustrate the toy model used in our empirical experiment. In the same setup, we empirically observe that the margin on population with less density is large, i.e., margin is much greater than when the number of observed samples is small (in fig:toy_model (d)). Building on this observation, nagarajan2020understanding showed in cases when margin decreases with number of samples, a max margin classifier trained on finite samples is bound to depend on the spurious features in such cases. They referred to this skew as geometric skew.

Moreover, even when the number of samples are large so that we do not observe geometric skews, nagarajan2020understanding

showed that training for finite number of epochs, a linear classifier will have a non zero dependency on the spurious feature. They referred to this skew as

statistical skew. Due both of these skews, we observe that a linear classifier obtained with training for finite steps on training data with finite samples, will have a non-zero dependency on the spurious feature. We refer interested reader to nagarajan2020understanding for more details.

Proof of thm:toy_theory Recall, we consider a easy-to-learn binary classification problem with two features where is fully predictive invariant feature with a margin and is a spurious feature (i.e., a feature that is correlated but not predictive of the true label). Conditional on , the distribution over is given as follows:

(9)

where is a fixed constant greater than . For simplicity, we assume that label distribution on source is uniform on . is distributed such that , where controls the degree of spurious correlation. To model distribution shift, we simulate target data with different degree of spurious correlation, i.e., in target distribution . Note that here we do not consider shifts in the label distribution but our result extends to arbitrary shifts in the label distribution as well.

In this setup, we examine linear sigmoid classifiers of the form where . We show that given a linear classifier that relies on the spurious feature and achieves a non-trivial performance on the source (i.e., ), ATC with maximum confidence score function consistently estimates the accuracy on the target distribution. Define and . Notice that in target distributions, we are changing the fraction of examples in and but we are not changing the distribution of examples within individual set. Given any classifier with in the above setting, assume that the threshold is obtained with finite sample approximation of (1), i.e., is selected such that111Note that this is possible because a linear classifier with sigmoid activation assigns a unique score to each point in source distribution.

(10)

where are samples from source distribution. Fix a . Assuming , then the estimate of accuracy by ATC as in (2) satisfies the following with probability at least ,

(11)

where is any target distribution considered in our setting and if and otherwise.

Proof.

First we consider the case of . The proof follows in two simple steps. First we notice that the classifier will make an error only on some points in and the threshold will be selected such that the fraction of points in with maximum confidence less than the threshold will match the error of the classifier on . Classifier with and will classify all the points in correctly. Second, since the distribution of points is not changing within and , the same threshold continues to work for arbitrary shift in the fraction of examples in , i.e., .
Note that when , the classifier makes no error on points in and makes an error on a subset of , i.e., . Consider as the set of points that obtain a score less than or equal to . Now we will show that ATC chooses a threshold such that all points in gets a score above , i.e., . First note that the score of points close to the true separator in , i.e., at and match. In other words, score at matches with the score of by symmetricity, i.e.,

(12)

Hence, if then we will have which is contradiction violating definition of as in (10). Thus .
Now we will relate LHS and RHS of (10) with their expectations using Hoeffdings and DKW inequality to conclude (11). Using Hoeffdings’ bound, we have with probability at least

(13)

With DKW inequality, we have with probability at least

(14)

for all . Combining (13) and (14) at with definition (10), we have with probability at least

(15)

Now for the case of , we can use the same arguments on . That is, since now all the error will be on points in and classifier will make no error , we can show that threshold will be selected such that the fraction of points in with maximum confidence less than the threshold will match the error of the classifier on . Again, since the distribution of points is not changing within and , the same threshold continues to work for arbitrary shift in the fraction of examples in , i.e., . Thus with similar arguments, we have