Unsupervised Calibration under Covariate Shift

06/29/2020 ∙ by Anusri Pampari, et al. ∙ 12

A probabilistic model is said to be calibrated if its predicted probabilities match the corresponding empirical frequencies. Calibration is important for uncertainty quantification and decision making in safety-critical applications. While calibration of classifiers has been widely studied, we find that calibration is brittle and can be easily lost under minimal covariate shifts. Existing techniques, including domain adaptation ones, primarily focus on prediction accuracy and do not guarantee calibration neither in theory nor in practice. In this work, we formally introduce the problem of calibration under domain shift, and propose an importance sampling based approach to address it. We evaluate and discuss the efficacy of our method on both real-world datasets and synthetic datasets.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning models are increasingly being entrusted with complex decisions in many applications such as medical diagnosis (Triantafyllidis and Tsanas, 2019), justice system (Berk and Hyatt, 2015), financial decisions (Heaton et al., 2017), human robot interaction (Modares et al., 2015), etc. In all these applications, models must not only be accurate, but should also indicate confidence in their own predictions. Uncertainity quantification is important for safety-critical applications and in decision making. This will better inform when the model’s predictions are likely to be incorrect, and help in building trust with the user. For example, in medical diagnosis if the model is not confident about it’s prediction, then the decision making should be passed on to a doctor. Additionally, humans have a natural cognitive intuition for probabilities (Cosmides and Tooby, 1996). Calibrated probabilities provide an intuitive explanation to a model’s predictions, making them interpretable.

Ideally, the confidence or probability associated with the predicted class label should reflect its ground truth occurrence likelihood. For example, suppose a diabetes risk prediction model predicts a chance of 70% for a specific patient profile. Then, we expect that out of 100 similar patients, about 70 should have diabetes. Such a model is said to be calibrated

. Many existing machine learning models, such as SVMs, Gaussian processes, and Neural Networks, are not naturally calibrated

(Guo et al., 2017; Bella et al., 2010)

, thus producing unreliable confidence estimates. This can, in turn, lead to bad decision making and reduce the trust in using these models.

Figure 1: Reliability diagram for a LeNet-5 model trained using CDAN (SOTA domain adaptation technique) on MNIST and tested on USPS as target data.

Existing literature (Platt and others, 1999; Zadrozny and Elkan, 2001; 2002; Bella et al., 2010; Guo et al., 2017) introduces many post processing techniques to correct these miscalibrated models. However, they assume the availability of labeled held-out validation data drawn from the same distribution as the test data to achieve calibration. This assumption is violated in many real world scenarios in the following two ways. Firstly, the test dataset can have a different distribution due to covariate shift. This can happen, e.g., when the operating conditions at test time are slightly different. Secondly, labelled test data is often unavailable if distribution shift occurs after training. While several unsupervised domain adaptation methods (Chu and Wang, 2018; Kouw and Loog, 2019) propose solutions for correcting the accuracy of the models, there is no existing work to correct the effects of these circumstances on the confidence. For example, in Figure. 1, we show how an existing calibration method temperature scaling (t-scaling) (Guo et al., 2017) can fail to calibrate a LeNet-5 model trained using CDAN (a SOTA domain adaptation model (Long et al., 2018)) under domain shift. Here the model is trained on MNIST and has a prediction accuracy of 70% on USPS dataset.

In this work, we introduce and investigate the problem of miscalibration under covariate shift. We demonstrate that existing models learnt using domain adaptation are poorly calibrated, showing that while current domain adaptation techniques account for accuracy, they do not consider calibration of the models. We then propose a modification to the calibration optimization objective used by existing techniques. Our solution employs importance sampling to account for the difference in the training and testing distributions, thereby overcoming the inherent assumptions of the existing methods. Our proposed method can adapt any existing calibration method under covariate shift assumption without requiring any labeled data from the test distribution. In Figure. 1, we show how our method (weighted t-scaling) adapts the use of t-scaling on source validation data to work under domain shift. We achieve close performance to perfect calibration or calibration obtained using labeled target data.

To summarize our contributions,

  • We introduce the problem of miscalibration under covariate shift, and show how existing domain adapted models such as CDAN (Long et al., 2018) remain uncalibrated in the target domain on using existing calibration methods;

  • We propose an importance sampling based solution to address the problem of calibration under covariate shift. Our method requests no additional labels from the test distribution and can be used to adapt any calibration method;

  • We use a discriminator trained on a domain-invariant feature layer of source and target to get density ratios for importance sampling.

2 Related Work

Background and Notation Calibration can be described mathematically as follows. Suppose that we have some data, comprising of inputs and labels

, which follows the ground truth joint distribution

. Let be a classifier learned for this data, i.e. which for an input

, is the probability distribution over the

classes in . The class with the maximum probability of occurrence is the prediction , and its corresponding probability is the confidence prediction . The classifier is said to be calibrated (Guo et al., 2017) when,


Many existing classifiers do not naturally satisfy these requirements (Guo et al., 2017; Bella et al., 2010), and are therefore said to be miscalibrated. This error is inevitable because of many reasons such as using finitely many samples to learn the classifier , model mismatch from the true distribution, optimization issues, etc. Existing calibration techniques (Platt and others, 1999; Zadrozny and Elkan, 2001; 2002; Bella et al., 2010; Guo et al., 2017) reduce this error in post-processing steps to produce calibrated probabilities. A calibration model (parametrized by

) is applied over the uncalibrated classifier. Each method defines an approximate variant of the calibration error using a loss function of the form

and learns the parameters as the minimizer of this loss. We discuss some of the popular calibration methods and their corresponding expected loss function briefly.

Platt Scaling: (Platt and others, 1999) is a parametric approach to calibration. The multi-class predictions of a classifier

are used as features for a multionomial logistic regression model

, which is trained on the validation set to return probabilities. The accuracy of the classifier can change when using the calibrated probabilities for prediction (Guo et al., 2017).

Temperature Scaling (t-scaling): This method proposed by (Guo et al., 2017) is popularly used for neural network calibration. It uses a single scalar parameter called the temperature for all classes. Here, the calibrated probabilities do not affect the accuracy of the classifier .

Parameters of the calibration model in both the above methods is optimized by using the NLL loss over the validation set. Hence,


where n is the number of samples drawn from the joint distribution and

is represented as one hot vector of size


Quantifying miscalibration: The common metrics used to report calibration performance are Expected Calibration Error or ECE (Guo et al., 2017) and reliability diagrams (DeGroot and Fienberg, 1983; Niculescu-Mizil and Caruana, 2005). We briefly describe both these measures and use it in our work to report calibration performance.

We start with grouping confidence predictions into interval bins (each of size ). Let be the set of indices of samples whose prediction confidence falls into the interval We define accuracy of bin as,

where and are the predicted and true class labels for sample i. We also define the average confidence within bin as,

where is the confidence for sample i.

Expected Calibration Error (ECE): (Guo et al., 2017) define ECE as a weighted average of the bins’ accuracy/confidence difference.

where n is the number of samples. Lower ECE indicates better calibration.

Reliability diagrams are visual representation of model calibration (DeGroot and Fienberg, 1983; Niculescu-Mizil and Caruana, 2005) as shown in Figure. 1. These diagrams plot accuracy or the empirical frequency as a function of confidence for each bin . So the x-axis here ranges from and is divided into M intervals. If the model is perfectly calibrated the diagram should plot the identity function. Any deviation from a perfect diagonal represents miscalibration.

Limitations of existing work: Existing calibration methods as discussed above rely on the evaluation of the loss function , which necessitates the need of labeled held-out validation data. Here the methods inherently assume that the train, validation and test data is drawn from the same distribution . This is violated in many real world scenarios where covariate shift occurs (discussed in Section. 3). (Snoek et al., 2019) empirically show that deep NN are uncalibrated under domain shift.

The effect of covariate shift on classifiers predictive performance and various solutions to address it has been studied under unsupervised domain adaptation literature (Chu and Wang, 2018; Kouw and Loog, 2019). They assume availability of labeled train data and assume no labels on the test data. The performance of these models is measured using accuracy, while not considering the calibration of the models. Our work introduces this issue for domain adaptation models. We show how existing methods fail to calibrate domain adaptation models and provide a simple modification to adapt any existing calibration technique to dataset shift.

3 Miscalibration Under Covariate Shift

Figure 2: Comparing true probability distribution with (output of classifier ) obtained after post-hoc calibration. (a) true and (b) of source and after calibration using source (c) of target and after calibration using source (d) of source and after calibration using target (e) of source and after calibration using our weighted method. (f) Reliability diagram on target

In this section, we discuss covariate shift and how it affects calibration using a synthetic example shown in Figure. 2. We organize this section as follows- (1) we formalize the assumption of covariate shift (2) we consider a case of a miscalibrated classifier (3) We attempt to calibrate the classifier using existing techniques on source validation data and (4) We assume the availability of target labels and show how the calibration performance can differ from doing calibration using source as in (3).

Covariate shift assumption: Consider that represents the joint distribution of inputs and outputs of the training data and represents the same for the testing data. Under covariate shift, we assume that


i.e the input distribution changes between train and test data (covariate denotes input), while the conditional distribution of the outputs given the inputs remains unchanged. This is illustrated in Figure. 2

(a) using two multivariate Gaussian distributions with different co-variance matrices as our initial distribution for

for source and for target. We consider a binary classification task where is the same for both source and target and changes as a function of the x-coordinate. This results in the difference of joint distribution between train and test, . The resulting labeled source and target data are also highlighted as source 1, source 0 and target 1, target 0.

Classifier mis-calibration: To emulate a realistic setting, we mis-specify the classifier by training a non-linear MLP classifier on finite samples from the source data. This allows us to have a situation for which probability distribution learnt by significantly deviates from the true . If was capable of learning the true relationship, the model would be calibrated both on source and target.

Correcting the calibration with existing techniques: We then attempt to calibrate using an existing calibration technique called isotonic regression (Zadrozny and Elkan, 2002) resulting in the calibrated probability distribution shown in Figure. 2(b) for source and in 2(c) for target. Here we notice that is calibrated on source, but not on target. For example consider the 0.7-0.8 probability band. Here, 70-80% points are positive in source showing calibration whereas nearly 100% of points in the target are positive (red), showing miscalibration.. The loss function used in the current calibration methods (as discussed in Section. 2) is defined on a held out source validation data assumed to have the same distribution as the train data, i.e is the evaluated loss function. These methods are ideal when the train and test distributions are identical. However, if , it follows that the expected loss function to be minimized is different for train and test distribution, i.e. . On using calibration methods derived using the train data over the shifted test data, the confidence estimates given by the model are no longer reliable.

Calibration using target data: One can solve this by obtaining labeled data under the test distribution and directly computing the expected loss function over the test data, . However, we often do not not have access to the label information on the test data. Here for illustration, we assume the availability of labeled test data for calibration and show the resulting probability distribution in Figure. 2(d) for the target data. The difference in the resulting probability distribution in Figure. 2(c) and Figure. 2(d), clearly show how the calibrated probability distributions differ when using source or target data for calibration. Further, Figure. 2(f) shows quantitatively using reliability diagram that using source data for calibration on target can perform worse than an uncalibrated model moving it further away from perfect calibration.

Mis-calibration in domain adapted classifier: The above discussion also extends to the case when is learnt using existing domain adaptation techniques. In addition to labeled source data, these techniques also use the unlabeled target data to learn the classifier . This reduces overfitting of the learnt classifier on the source labeled data, and hence improves the generalization (or predictive accuracy) on the unseen target data. However, these models can still remain uncalibrated in the target domain. This is inevitable because we learn from finite source data, or simply due to optimization issues. In Figure. 1 we show a reliability diagram of domain adapted classifier (CDAN on LeNet-5) on USPS dataset, the classifier is trained on labeled MNIST and achieves an accuracy of 70% on USPS. We notice that uncalibrated classifier is far from perfect. Even after using existing calibrations methods like t-scaling on source validation data, we notice that the model still remains uncalibrated.

Both the synthetic and domain adapted examples discussed above, show the performance gap between perfect calibration (or reference calibration obtained using labeled target) and calibration obtained by using source data. We seek to close this performance gap without requesting new labeled data from the target.

4 Importance Sampling for Calibration Under Covariate Shift

To address the problem of miscalibration under covariate shift discussed in Section. 3, we introduce an importance sampling approach for estimating the calibration loss. For this, we assume access to labeled training data , and unlabeled test data . A classifier is assumed to be trained either using only the labeled train data, or by using existing unsupervised domain adaptation techniques. Our objective is to ensure that the classifier is calibrated on the test distribution. We describe our approach and the intuition behind it.

Consider the calibration loss defined over the test distribution . This cannot be computed using samples drawn from since we do not have access to the labels from the test distribution. However, note that we have access to samples drawn from and hence the calibration loss over the training distribution can be computed as . Hence, we seek to adapt the calibration loss defined on the training distribution to formulate the calibration error on the test distribution. This can be done using importance sampling in the following way:-

Using the covariate shift assumption, we have and . From these assumptions, it follows that:

The above result is summarized in Theorem 4.1,

Theorem 4.1

The calibration loss with covariate shift correction on the test data is equivalent to the density ratio weighted calibration loss on the training data, i.e

where is the density ratio and where =

Weighting the train data with density ratios given by is an importance sampling approach. By increasing the relative weight of those regions of the training distribution which also have a high density under the test distribution, we adapt to represent . We can observe the qualitative behaviour of calibration when using our method in the following way. Consider the synthetic data example and the isotonic regression calibrator trained on the source data in Section. 3. We incorporate the weighted calibration loss for isotonic regression to optimize the calibrator on the source data. Here, we use the ground truth density ratios computed from the known distributions. The resulting probability distribution obtained in Figure. 2(e) using our method is similar to the probability distribution obtained by using target labels in Figure. 2(d). We also note from the reliability diagram in Figure. 2(e) that the performance after using weighted calibration loss is closer to perfect-calibration.

In order to estimate density ratios , we require knowledge of the true data distribution for train and test data, which is unknown. However, we typically have sampling access to via finite datasets which we use to estimate the density ratios in a likelihood-free (LF) way. Examples of some LF estimators include nearest neighbour (Kremer et al., 2015), discriminative estimation (Bickel et al., 2007) etc. We can, in principle, estimate this ratio directly in the original input space. However, when the inputs are high dimensional, the estimated loss may suffer from large estimation variances because of greater divergence between the distributions and (Snoek et al., 2019). We further elaborate on this observation and discuss solutions to address it in the subsequent subsections.

4.1 Feature Representation for Importance Sampling

In this section, we discuss practical difficulties in applying importance weighted calibration and introduce a method to address some of these difficulties by using a suitable feature representation. There are two primary concerns in using importance weighted calibration on real train and test distributions.

  • Accuracy of estimation: The variance of calibration loss estimate in Theorem 4.1, and hence the accuracy of the calibration is affected by the divergence between and . This relation is discussed in (Cortes et al., 2010) and summarized in Lemma 4.2 as follows:

    Lemma 4.2

    The variance of importance weighted calibration loss is bounded by the Renyi divergence ,

    where Renyi divergence =

    where hyperparameter


    From Lemma. 4.2 it is clear that the smaller the divergence, the better the chance of getting an accurate estimate of the calibration loss, in turn affecting the accuracy of the final calibrator.

  • Unbounded/ undefined density ratios: The support of the train distribution might not contain the support of the test distribution as required by Theorem 4.1. When this is violated, the density ratios can grow to infinity thus resulting in a undefined or incorrect estimate.

We address both these concerns using a method similar to (You et al., 2019), by estimating importance weights using domain-invariant features instead of the original covariates. Let and be the domain-invariant feature distributions of the train and the test data respectively, we step from the input space to the feature space and estimate instead of . This ensures that the variance of calibration loss estimate is bounded, because by using domain-invariant features we have smaller than . Furthermore, the assumption on the support of in can hold well in the learned feature space because of increased overlap in the distributions compared to the covariate space.

Note that by estimating the importance weights in the domain-invariant feature space we can only reduce the distribution divergence, and never completely eliminate it to zero. Hence we can only expect to improve over the bias created by original unweighted calibrator. Perfect calibration close to using the target labels is not guaranteed.

4.2 Density Ratio Estimation

To compute density ratios we adopt an approach similar in (Bickel et al., 2007; You et al., 2019) where a discriminator is used to distinguish or classify source samples (with label d=1) from target samples (with label d=0). Under this model where density ratio estimation is decomposed into two parts - (1) which can be estimated by a discriminative model to distinguish source and target samples. The model here is trained on the domain-invariant feature representation of source and target. and (2) - is a constant value that can be estimated with the sample sizes of both domains.

Practical considerations: The importance weights learnt by the discriminator may differ from the true density ratios. This can happen because, (1) the divergence between train and test is not completely zero leading to high variance (Lemma. 4.2) and (2) training on finite samples from source and target data leads to over-fitting of the discriminator on some features. This results in highly confident predictions and hence small importance weights. We follow (Grover et al., 2019) to offset these challenges using the following techniques - self-normalization, flattening, and clipping.

5 Experimental Setup

Figure 3: Figure showing different parameters that effect calibration. (a) Increasing divergence between source and target (b) Increasing number of samples used for calibration (c) Increasing noise in the ground truth importance weights

In this section, we evaluate the efficacy of our proposed importance weighting technique in adapting two post-hoc calibration methods, Platt scaling and temperature scaling (t-scaling), to handle domain shifts. We use the Expected Calibration Error (ECE) discussed in Sec. 2 to measure the calibration performance on the target data.

We compare the performance of our calibration method (Weighted) to three baselines:
(1) Uncalibrated, i.e. the source classifier as is without any post-hoc calibration;
(2) Unweighted, i.e. the post-hoc calibrator is trained on the source domain;
(3) Using target or target-calibrated i.e. the post-hoc calibrator is trained on the labeled target domain. This can be considered as a gold standard (requiring labels from target domain), i.e. a lower-bound on the calibration error for the target data.

Dataset CIFAR-10 classes source ratio target ratio
S1 T1 2&7 1:4 4:1
S2 T2 1&8 2:5 3:4
S3 T3 3&4 5:1 1:3
S4 T4 6&9 2:3 5:1
Method S1 T1 S2 T2 S3 T3 S4 T4
Uncalibrated 0.134 0.019 0.142 0.020
Unweighted 0.163 0.018 0.204 0.026
Weighted 0.040 0.023 0.042 0.017
Using target 0.037 0.024 0.041 0.016
Unweighted 0.124 0.007 0.144 0.013
Weighted 0.030 0.005 0.030 0.007
Using target 0.027 0.005 0.029 0.007
Table 1: ECE scores of Platt (top) and t-scaling (bottom), comparing baselines and the proposed method on pseudo-synthetic datasets. The weighted method uses ground truth density ratios for calibration.

The rest of the section is organized as follows. First, we study the behaviour of the calibration on pseudo-real datasets. This paradigm allows us to control the density ratios and analyze how the performance is affected by it. Then, we apply this technique on real world datasets. Here, we derive the importance weights using the discriminative density ratio estimation method (Sec. 4.2

) and use them for weighted calibration. The classifiers used here include both ImageNet pre-trained ResNet50

(He et al., 2016) models trained only on the labeled source data and popular domain adapted models such as CDAN (Long et al., 2018) which use both labeled source data and unlabeled target data.

Uncalibrated 0.038 0.036 0.147 0.158 0.045 0.096
Unweighted 0.082 0.06 0.278 0.093 0.199 0.085
Weighted 0.125 0.051 0.06 0.041 0.052 0.083
Using target 0.105 0.062 0.057 0.06 0.026 0.013
Unweighted 0.040 0.045 0.145 0.028 0.136 0.023
Weighted 0.047 0.081 0.134 0.031 0.039 0.02
Using target 0.047 0.030 0.031 0.029 0.028 0.013
(a) 0.55
Method UM MU
uncalibrated 0.309 0.206
unweighted 0.134 0.258
weightes 0.132 0.077
using target 0.035 0.023
unweighted 0.154 0.196
weighted 0.108 0.13
using target 0.140 0.042
(b) 0.5
Table 2: ECE scores of Platt scaling (top) and t-scaling (bottom), comparing the baselines and the proposed method. Importance weight are estimated using discriminator.
Method ArCl ArPr ArRw ClAr ClPr ClRw PrAr PrCl PrRw RwAr RwCl RwPr
Uncalibrated 0.114 0.091 0.119 0.135 0.133 0.113 0.098 0.158 0.021 0.124 0.102 0.027
Unweighted 0.109 0.098 0.093 0.104 0.077 0.099 0.306 0.197 0.133 0.083 0.159 0.148
Weighted 0.093 0.081 0.093 0.071 0.051 0.068 0.085 0.173 0.081 0.067 0.132 0.059
Using target 0.092 0.060 0.029 0.052 0.076 0.053 0.028 0.134 0.069 0.035 0.074 0.05
Unweighted 0.145 0.033 0.038 0.041 0.079 0.078 0.038 0.287 0.104 0.117 0.167 0.027
Weighted 0.128 0.015 0.064 0.069 0.064 0.057 0.085 0.268 0.108 0.104 0.173 0.038
Using target 0.046 0.034 0.043 0.048 0.026 0.040 0.049 0.038 0.029 0.019 0.036 0.026
Table 3: Office-home, domain adapted using using CDAN. ECE scores of Platt scaling (top) and t-scaling (bottom), comparing baselines and the proposed method. Importance weight are estimated using discriminator.

5.1 Pseudo-Real World Experiments

We construct synthetic datasets using the CIFAR-10 dataset which consists of 60K color images distributed equally across ten object classes (Krizhevsky et al., 2009). We randomly pick two classes and collect samples from these two to define a binary classification task. We vary the mixing ratio of the two classes, thereby creating datasets with covariate shift. Sample values used in experiments for source and target domain are documented in Table 1 as (). For example, if the source consists of ratio of class 1 and class 2, then the target with ratio of for the classes can be seen to have a covariate shift. To create ratio values greater than one, we duplicate the data points of the class. Our method of construction automatically gives us the ground truth importance weights from the mixing ratios of source and target data. In the previous example, the source points from class 1 have an importance weight of 4 and the source points in class 2 have an importance weight of 0.25 for the given target.

Using these ground truth importance weights we perform a weighted calibration of a LeNet-5 classifier trained on the labeled source data. We use 70% of the source data for training the classifier and 30% as validation data for calibration. For testing, we use 70% of the target data to compute the ECE and 30% as validation data for target-calibrated model. We consider 10 different train, validation, and test splits and report the mean in Table. 1

(standard deviation in supplementary). We note that the weighted calibration significantly outperforms unweighted and uncalibrated models except in

where the source ratio is close to the target ratio. We further make use of this setting to empirically study the effect of the following parameters on the calibration performance (ECE) in the target domain.

Domain shift between source and target: We consider two classes of CIFAR-10 and fix the source class ratio to and the calibration method to t-scaling. We then change the target ratios as . Here as the value of decreases and increases the domain shift of target compared to the source increases. In Figure. 3(a), we see that performance gap in ECE between unweighted source calibration and using target for calibration increases as the datashift increases. The weighted calibration performs as well as using target labels for calibration (lower bound), even though our method doesn’t have access to any target labels.

Number of validation samples used for training the calibrator: We vary the number of validation points used for t-scaling in Figure. 3(b) and keep the remaining parameters fixed. We note that the weighted ECE performance may worsen compared to unweighted calibration at significantly smaller validation sample size. Also, the performance of target calibration and unweighted source calibration itself may degrade with decreasing validation samples. The threshold of sample size for this degradation maybe differ based on the complexity of the dataset, e.g., the number of classes.

Quality of importance weights: In reality, empirically estimated density ratios are noisy and may deviate from the ground truth ratios. In Figure. 3(c) we simulate this setting by increasing the average amount of normal noise added to the importance weights (rest of the parameters are constant) and observing its effect on the calibration performance. We notice that with increasing noise, weighted calibrations performance performance can degrade to below uncalibrated. The sensitivity of the calibrator to noise can change with extent of domain shift.

5.2 Experiments on Real World Data

In this section, we evaluate calibration performance on real world datasets using classifiers trained only on source (using pre-trained ResNet50) and a range of domain adapted classifiers (LeNet-5 and ResNet50 trained using CDAN111We train using publicly available codes given by authors, pre-trained ResNet-50). The accuracy of all the models is reported in the supplementary. We divide both source and target data into 70/30 splits (we directly use standard train/test when available). For the source, the larger split is used for training and the smaller split is used as validation for the post-hoc calibration method. For the target, the larger split is used for testing and the smaller split is used to train the target-calibrated model. To obtain importance weights for our method we train a discriminator (2-hidden layer MLP) on domain invariant features as discussed in Section. 4.2. We perform normalization on the obtained weights, and leave experimentation with flattening and clipping for future work. We experiment with different discriminator and calibrator initializations keeping the classifier and dataset fixed and report the mean performance for 5 iterations (standard deviation in supplementary).

Classifiers trained only on source: We use the Office-31 dataset (Saenko et al., 2010) which is concerned with the task of object recognition. This dataset has images from four domains: Amazon images (A), Webcam (W) (low-resolution) and DSLR (high-resolution) (D), with 4,652 images and 31 categories. We evaluate on six source to target transfer tasks A W, A D, D A, D W, W A and W D. We use Imagenet pre-trained ResNet-50 as our initial classifier with the final layer replaced to output 31 classes. We re-train it on the labeled source and test it on the target. The domain-invariant feature representation for the discriminator is obtained form the final layer of the pre-trained Resnet-50 model.

Domain adapted classifiers: We use Conditional Domain Adversarial Network (CDAN) (Long et al., 2018), a recent domain adaptation technique to train two different classifiers on different datasets mentioned here - (1) Digits dataset consists of images form MNIST (M) and USPS (U) (Ganin et al., 2016) comprising of 10 classes, here we apply CDAN on a LeNet-5 classifier. We evaluate two source to target transfer tasks M U and U M. (2) Office-Home dataset (Venkateswara et al., 2017) consists of images form Art (Ar, 2427), Clipart (Cl, 4365), Product (Pr, 4439) and Realworld (Rw, 4357) (size in parenthesis) comprising of 65 classes. Here we apply CDAN on Resnet50 classifier. We evaluate 12 source to target transfer tasks shown in Table. 3, exploring all the permutations of the four datasets. In both the datasets, we use the features obtained from the domain adapted layer of CDAN to train our discriminator.

Discussion: In Table. 2(a), 2(b) and 3. we compare the ECE scores of our weighted calibration methods with the baselines. The models here span a range of accuracy’s from 30% to 97% on the target data and still remain uncalibrated. This shows that accounting for accuracy alone does not gurantee calibration. We use bold font to highlight results where weighted calibration outperforms the uncalibrated ECE. In italics we highlight weighted calibration which reduces the bias in unweighted calibration but still performs worse than uncalibrated ECE. This is in agreement with our discussion in Section. 4.2, where we note that we can only reduce the bias in using the source data for calibration but not completely eliminate. From these experiments, we observe that our proposed method helps in increasing the calibration performance considerably in number of cases such as D A in Office-31 dataset where the ECE performance improves from 14.7% to 6% , M U in MNIST-USPS dataset where the ECE performance improves from 20.6% to 7.7% and Cl Pr in Office-home dataset where the ECE performance improves from 13.3% to 6.4%.

To explain the poor performance of weighted calibration on the remaining datasets, we refer to the analysis in Figure.3. For example, in experiments involving office-31 datasets consider or where has considerably larger data compared to or . This could have affected the importance of weight estimation (leading to overfitting of the discriminator and hence resulting in poor importance weights) or the low samples used in validation data could have itself affected both the weighted and using target labels calibration performance. In general, performance of calibration can be affected by multiple factors.

6 Conclusion

In this work, we identified that neural models, including domain adapted models, are miscalibrated under covariate shift. This indicates that existing domain adaption techniques focus on accuracy and not calibration. Existing calibration techniques fail to calibrate them or even sometimes worsen the calibration performance. This is a result of the inherent assumptions made by these techniques which fail to hold true when domain shift occurs. We propose a new method that overcomes the limitations of the existing techniques and adapts any calibration technique to work under domain shift using importance sampling. We show that with ground truth density ratios our method significantly improves the calibrator. We further implement the proposed method on real world datasets by employing a binary classifier to estimate the density ratios. We demonstrate performance improvements on different datasets and analyze the effects of the different parameters involved. We also note that the efficacy of our method on real world datasets is limited by accuracy of the density ratio estimation process. Therefore, we observe that improving density ratio estimation is a crucial future direction of research which will help in improving calibration performance.


  • A. Bella, C. Ferri, J. Hernández-Orallo, and M. J. Ramírez-Quintana (2010) Calibration of machine learning models. In Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques, pp. 128–146. Cited by: §1, §1, §2.
  • R. Berk and J. Hyatt (2015) Machine learning forecasts of risk to inform sentencing decisions. Federal Sentencing Reporter 27 (4), pp. 222–228. Cited by: §1.
  • S. Bickel, M. Brückner, and T. Scheffer (2007) Discriminative learning for differing training and test distributions. In Proceedings of the 24th international conference on Machine learning, pp. 81–88. Cited by: §4.2, §4.
  • C. Chu and R. Wang (2018)

    A survey of domain adaptation for neural machine translation

    arXiv preprint arXiv:1806.00258. Cited by: §1, §2.
  • C. Cortes, Y. Mansour, and M. Mohri (2010) Learning bounds for importance weighting. In Advances in neural information processing systems, pp. 442–450. Cited by: item 1.
  • L. Cosmides and J. Tooby (1996) Are humans good intuitive statisticians after all? rethinking some conclusions from the literature on judgment under uncertainty. cognition 58 (1), pp. 1–73. Cited by: §1.
  • M. H. DeGroot and S. E. Fienberg (1983) The comparison and evaluation of forecasters. Journal of the Royal Statistical Society: Series D (The Statistician) 32 (1-2), pp. 12–22. Cited by: §2, §2.
  • Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016) Domain-adversarial training of neural networks. The Journal of Machine Learning Research 17 (1), pp. 2096–2030. Cited by: §5.2.
  • A. Grover, J. Song, A. Kapoor, K. Tran, A. Agarwal, E. J. Horvitz, and S. Ermon (2019) Bias correction of learned generative models using likelihood-free importance weighting. In Advances in Neural Information Processing Systems, pp. 11056–11068. Cited by: §4.2.
  • C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017) On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1321–1330. Cited by: §1, §1, §2, §2, §2, §2, §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §5.
  • J. Heaton, N. Polson, and J. H. Witte (2017) Deep learning for finance: deep portfolios. Applied Stochastic Models in Business and Industry 33 (1), pp. 3–12. Cited by: §1.
  • W. M. Kouw and M. Loog (2019) A review of domain adaptation without target labels. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1, §2.
  • J. Kremer, F. Gieseke, K. S. Pedersen, and C. Igel (2015) Nearest neighbor density ratio estimation for large-scale applications in astronomy. Astronomy and Computing 12, pp. 67–72. Cited by: §4.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §5.1.
  • M. Long, Z. Cao, J. Wang, and M. I. Jordan (2018) Conditional adversarial domain adaptation. In Advances in Neural Information Processing Systems, pp. 1640–1650. Cited by: 1st item, §1, §5.2, §5.
  • H. Modares, I. Ranatunga, F. L. Lewis, and D. O. Popa (2015)

    Optimized assistive human–robot interaction using reinforcement learning

    IEEE transactions on cybernetics 46 (3), pp. 655–667. Cited by: §1.
  • A. Niculescu-Mizil and R. Caruana (2005)

    Predicting good probabilities with supervised learning

    In Proceedings of the 22nd international conference on Machine learning, pp. 625–632. Cited by: §2, §2.
  • J. Platt et al. (1999)

    Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods

    Advances in large margin classifiers 10 (3), pp. 61–74. Cited by: §1, §2, §2.
  • K. Saenko, B. Kulis, M. Fritz, and T. Darrell (2010) Adapting visual category models to new domains. In European conference on computer vision, pp. 213–226. Cited by: §5.2.
  • J. Snoek, Y. Ovadia, E. Fertig, B. Lakshminarayanan, S. Nowozin, D. Sculley, J. Dillon, J. Ren, and Z. Nado (2019) Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems, pp. 13969–13980. Cited by: §2, §4.
  • A. K. Triantafyllidis and A. Tsanas (2019) Applications of machine learning in real-life digital health interventions: review of the literature. Journal of medical Internet research 21 (4), pp. e12286. Cited by: §1.
  • H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan (2017) Deep hashing network for unsupervised domain adaptation. In (IEEE) Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §5.2.
  • K. You, X. Wang, M. Long, and M. Jordan (2019) Towards accurate model selection in deep unsupervised domain adaptation. In International Conference on Machine Learning, pp. 7124–7133. Cited by: §4.1, §4.2.
  • B. Zadrozny and C. Elkan (2001)

    Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers

    In Icml, Vol. 1, pp. 609–616. Cited by: §1, §2.
  • B. Zadrozny and C. Elkan (2002) Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 694–699. Cited by: §1, §2, §3.