Adversarial Examples for Cost-Sensitive Classifiers

10/04/2019 ∙ by Gavin S. Hartnett, et al. ∙ RAND Corporation 0

Motivated by safety-critical classification problems, we investigate adversarial attacks against cost-sensitive classifiers. We use current state-of-the-art adversarially-resistant neural network classifiers [1] as the underlying models. Cost-sensitive predictions are then achieved via a final processing step in the feed-forward evaluation of the network. We evaluate the effectiveness of cost-sensitive classifiers against a variety of attacks and we introduce a new cost-sensitive attack which performs better than targeted attacks in some cases. We also explored the measures a defender can take in order to limit their vulnerability to these attacks. This attacker/defender scenario is naturally framed as a two-player zero-sum finite game which we analyze using game theory.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many safety-critical classification problems are not indifferent to the different types of possible errors. For example, when classifying tumors in medical images one may be relatively indifferent between misclassifications within the super-categories of malignant or benign tumors, but may be particularly interested in avoiding misclassifications across those categories, for example misidentifying a malignant Lobular Carcinoma tumor as being instead a benign Fibroadenoma tumor cancerClassifier .

It is a relatively simple matter to adjust the predictions of a trained classifier to reflect the different costs associated with the various types of classification errors using a formalism known as cost-sensitivity elkan2001foundations ; domingos1999metacost . Without cost-sensitivity, the most likely class is taken to be the prediction made by a classification model. In contrast, cost-sensitive classifiers make predictions by first computing the expected cost associated with each prediction, and then taking the class with the smallest expected cost to be the model prediction.

Motivated by safety-critical classification problems, we investigate adversarial attacks on cost-sensitive classifiers. We use current state-of-the-art adversarially-resistant neural network classifiers xie2018feature as the underlying models, and we considered multiple types of attacks, as well as various defensive actions that may be taken to mitigate the effect of the attacks. Our key findings are:

  • Classifiers face a trade-off between maximizing accuracy and minimizing cost:
    Predictions can be made with the goal of either maximizing the accuracy or minimizing the expected cost. While these are not diametrically opposed goals (for example a perfect classifier will incur zero cost), in practice there will be a trade-off where the classifier can make conservative predictions which lower both the cost and the overall accuracy.

  • The attacker faces a trade-off between minimizing accuracy and maximizing cost:
    Similarly, the attacker can craft adversarial examples designed to either minimize the defender’s accuracy or increase their average cost. As before, these goals aren’t necessarily in conflict with one another - for example if the attacks succeed 100% of the time, then both goals may be simultaneously accomplished, but in practice attacks will only succeed some fraction of the time and the attacker will be faced with a trade-off.

  • Calibration leads to both better defenses and more effective attacks:

    The expected cost depends on the predicted probabilities of the neural network and not just on the overall class prediction. Therefore, it is important that the classifier produce accurate probability estimates. We find that both cost-sensitive defenses and attacks may be improved by calibrating these estimates.

  • The attacker/defender scenario is naturally analyzed in terms of game theory:
    We explored many different pairings of attacks and defensive measures. The identification of good strategies becomes more difficult as the number of possible scenarios increases. We observe that this problem is naturally framed as a two-player zero-sum finite game, and therefore the game theoretic concepts of Nash equilibria and dominant strategies may be used to analyze the attacker/defender competition.

2 Cost-Sensitivity for classification problems

In this section we provide a brief review of cost-sensitivity elkan2001foundations ; domingos1999metacost . Throughout this work we shall consider classification problems and denote the inputs as , and we will use the indices to run over all possible classes, i.e. .

Of central importance in cost-sensitive classification problems is the cost-matrix , which is defined to be the cost of predicting class when the correct class is . The cost may be measured in any units since the cost-sensitive predictions are unaffected by scaling the cost matrix by an overall constant. We shall require that the costs are non-negative, , with equality if and only if , which reflects the fact that a correct classification should incur no cost. Given the cost matrix and , the model estimate for the probability that an input belongs to class , the expected cost of predicting class will be denoted as , and is simply elkan2001foundations


The cost-sensitive (CS) prediction is then the class for which the expected cost is smallest, i.e.


In contrast, in most classification settings the prediction is taken to be the most likely class:


which we shall refer to as the maximum probability (MP) prediction.

2.1 Geometry of cost-sensitive predictions

To gain an intuition for how cost-sensitive predictions compare to the more standard maximum probability predictions, it is useful to consider the problem from a geometrical perspective. Binary classification is especially simple: if the probability of class 1 is denoted , then the probability of class 2 is . The maximum probability prediction is determined by whether (prediction is class 2) or (prediction is class 1). The effect of cost-sensitivity then is to shift the decision threshold from to a new value determined by the relative cost of the two types of errors. That is, class 1 is predicted if , where now

Figure 1: The unit 2-simplex as a surface embedded in the 3-dimensional space spanned by the probability coordinates . The simplex has been divided into cells corresponding to the maximum-probability prediction. The figure has been oriented with the origin behind the simplex. Figure 2: The same simplex plotted in Fig. 1, now with the cells determined according to the minimum cost prediction. The cost matrix used here is given by , , and (and zero diagonal entries). Misidentifying class 3 incurs a large cost, and hence the class 3 cell has expanded accordingly.

The higher dimensional case of is more interesting. In this case it is more useful to work in terms of the probability simplex, which is a -dimensional hyper-surface embedded in -dimensions. The embedding coordinates are the class probabilities, i.e. for an input , and the simplex is the surface satisfying the constraints and . The vertices of the simplex are the points for which all the probability mass is placed on a single class. The simplex may be divided into cells such that all points within a single cell will lead to the same maximum probability prediction. This is depicted for in Fig. 1. The effect of cost-sensitivity is to shift the cell boundaries, for example as in Fig. 2. In general, cells representing classes which are costly to misidentify, for example the malignant Lobular Carcinoma tumor discussed above, will expand, corresponding to an increased risk aversion.

2.2 Multi-class classification problems with two super-categories

A general cost-matrix for -class classification is determined by parameters (assuming that the diagonals are zero, representing zero cost for correct predictions). Both for simplicity and because we are motivated by scenarios such as the benign/malignant tumor classification discussed above, we consider a much smaller family of cost-matrices. We split the classes into 2 super-categories, which we call the “sensitive" and “insensitive" categories. Let there be members of the insensitive group, and members of the sensitive group, and split the label index as for the insensitive group members, and for the sensitive group. We shall consider scenarios where the main concern is inter-category misclassifications, especially misclassifying a sensitive class as an insensitive class (i.e. misidentifying a malignant tumor as a benign tumor). Intra-category misclassifications will also have associated costs, albeit they will be less significant than inter-category costs.

In this scenario, we can break the cost-matrix into 4 blocks,


and take each constituent block matrix to be


Here the lower-case ’s are constants, and and are Kronecker deltas. The constant is the cost of misclassifications within the insensitive super-category, and is similarly the cost of misclassifications within the sensitive super-category. The off-diagonal term represents the cost of mis-labeling a sensitive class as insensitive, and vice versa for . Motivated by safety-critical scenarios where the most costly type of mistake is mis-identifying a sensitive class as insensitive, we will assume that the different costs obey the following inequalities:


so that the cost-matrix is determined by just 4 independent numbers.

3 Adversarial examples for cost-sensitive classifiers

Cost-sensitivity is particularly relevant for safety-critical scenarios because it enables classifiers to take into account the fact that some mistakes are more deleterious than others. This general framework naturally complements the context of adversarial examples, which are artificially generated inputs of a classifier designed to cause mistakes szegedy2013intriguing , and which are an important threat for safety-critical applications of classifiers.111See gilmer2018motivating for an analysis of the concrete ways in which adversarial examples are relevant for AI Safety. The general idea that different misclassifications are associated with different costs should be reflected both in how the classifier makes predictions and in the types of adversarial attacks a malicious actor would choose to employ in order to cause maximum damage.

Concretely, we consider an attacker/defender scenario in which the defender is a neural network classifier, and the attacker is an agent attempting to fool the defender by presenting it adversarial examples. We will investigate multiple types of attacks against classifiers making predictions according to both the maximum probability criterion and the minimum cost criterion. We also consider both white-box and black-box scenarios, in which the attacker has or does not have access to the defender network, respectively.

As usual, we take the attack to be defined by a constrained optimization problem. To set notation, let be the objective function ( may also depend on other quantities such as the target label), and the optimization problem is then


Here is the attack set, the set of allowable perturbations around a given clean input . Throughout this work, we will take to be an -ball in the norm, i.e. .222Other authors, most notably gilmer2018motivating , have noted a number of shortcomings in using this attack set for research into safety-critical implications of adversarial examples. We do not disagree with these observations, but will work with the ball nonetheless both for mathematical convenience and because we regard this issue as orthogonal to the main idea of the current work, which is the relevance of cost-sensitivity to adversarial example research. Independent of the objective function, in all cases we shall use the same projected gradient descent (PGD) method of madry2017towards to solve the optimization problem and to generate examples. The attack PGD update rule is


where represents a sequence of perturbed inputs, is the step-size parameter, and is a projection operator that projects the perturbation down to the attack set . The initial perturbation, , will be randomly initialized within the attack set .

3.1 Targeted attacks

We will consider two types of adversarial attacks. The first is a targeted attack, where the objective function is given by the negative cross-entropy of the target label. That is, if is the target label, then


In terms of the probability simplex coordinates, the optimal solution is when all the probability mass has been placed on the target class, i.e. . The target class could be chosen randomly, or it could be chosen to induce a particularly costly error. As an example in the cost-sensitive setting, an effective attack would be one which tricked the classifier into thinking that belonged to the insensitive class when in fact it belonged to a sensitive one.

3.2 Maximum minimum expected cost attacks

If the goal of the attacker is to increase the costs of the defender’s mistakes, it is natural to consider an attack which is designed to explicitly increase the expected cost. Therefore, we introduce the Maximum Minimum Expected Cost Attack (or maxi-min attack for short):


Unlike the targeted attack, the maxi-min attack does not depend on the true class. Thus, the maxi-min attack always aims to modify the input so that the point in the probability simplex moves to the point of maximal , which by symmetry can be seen to be the intersection point where the costs are identical for all class predictions . In particular, for the example of Fig. 2, this is the point where all 3 cell boundaries intersect. Because this attack aims to bring to an interior point in the simplex, as opposed to a vertex, it will not be as effective as a targeted attack with cost-sensitive targets - assuming that the optimization problem associated with both attacks can be fully solved. For example, for the cost matrix considered in Fig. 2, the expected cost at the intersection of all three cell boundaries is , whereas a cost of could be achieved if the prediction was 1 and the true class was 3. However, the optimization problem defining adversarial attacks is rarely able to be solved exactly, and thus there could well be instances where the maxi-min attack is more effective - indeed, we shall find this to be the case in what follows.

As far as we are aware, we are the first to consider adversarial attacks designed to directly maximize the cost. Recently, Zhang and Evans zhang2018cost considered a cost-sensitive extension of Wong and Kolter’s approach towards developing provably robust classifiers wong2017provable . In the Zhang and Evans extension, robustness is defined with respect to cost, as opposed to the overall misclassification error. Our work is complementary to theirs as we consider attacks designed to explicitly increase the cost.

4 Attack comparison

In this section we detail the numerical experiments used to compare the efficacy of the 3 different types of attacks considered here.

4.1 Experimental set-up

We considered the task of image classification on the ImageNet dataset

ILSVRC15 . Our motivating interest is near-term scenarios in which an imperfect but high-performance image classification system is employed in a safety-critical application. Given the amount of attention adversarial examples have received, it seems plausible that many organizations will be cognizant of the threat posed by adversarial examples, and will therefore choose to employ models with some level of resistance. For simple enough problems one can obtain provable guarantees regarding robustness (see for example wong2017provable ; raghunathan2018certified and references therein), but these methods do not currently scale for modern image classifiers trained on high-resolution images.333As this work was nearing completion progress on this problem was made in cohen2019certified . Thus, we shall focus on problems for which the vulnerability to adversarial examples can only be mitigated, not fully eliminated or bounded.

We consider attacking networks which have been adversarially trained goodfellow2014explaining ; kurakin2016adversarial ; kannan2018adversarial ; madry2017towards , so that they are somewhat resistant to adversarial attacks. In particular, we used pre-trained models released as part of the recent work xie2018feature . Three such pre-trained models were released: ResNeXt-101, ResNet-152 Denoise, and ResNet-152 Baseline. These models obtain between 62-68% top-1 accuracy on clean images, and 52-57% accuracy on adversarially perturbed images with random targets (we specify the attack details below). All three models were trained on adversarial examples, and the first two also incorporate a novel form of feature de-noising to enhance their resistance to adversarial examples.

A simple but crucial point is that a cost matrix is required in order to implement cost-sensitive predictions. The cost matrix encapsulates the costs associated with different types of mistakes, but these may be hard to quantify in certain applications. To return to the example of identifying malignant tumors, clearly false positives are less costly mistakes than false negatives, but are they 10x worse, 100x worse, or 1000x worse? These valuations must be made for each application, and could involve a rich set of considerations which we shall not get into here. Instead, we simply consider an arbitrary cost matrix with values chosen according to what seems like plausible values. In particular, we let there be insensitive classes and sensitive classes, with the costs taken to be444We note that we randomly permuted the ImageNet labels in order to avoid grouping together similar classes in the insensitive/sensitive super-categories.


Although these values were mostly chosen arbitrarily, they were picked so that the effect of being cost-sensitive would be non-trivial. For example, as , with the other values held constant, a cost-sensitive classifier will always err on the side of caution and predict the sensitive class. Similarly, if the differences in cost are very slight, then a cost-sensitive classifier will mostly make predictions according to the most likely class. These values were chosen to avoid either extreme. An additional complication is that an adversary may not know (or may only partially know) the cost matrix used by the defender network. Thus, in cost-sensitive adversarial examples the cost-matrix becomes part of the white-box/black-box characterization of the problem. In this work, we assume that the cost matrix is known to the attacker.

4.2 Experimental results

We generated adversarial attacks using the ResNeX1-101 pretrained model of Ref. xie2018feature , and evaluated the attacks against each of the 3 pretrained models. The attack is a white-box attack when the defending network is the same ResNeX1-101 model used to generate the attacks, and it is a black-box attack when the defending network is either of the ResNet-152 models. We considered 3 types of attacks: targeted with random targets, targeted with cost-sensitive targets, and the maxi-min attack introduced in Sec. 3. We use the same attack parameters as in xie2018feature , and used PGD to generate attacks for numbers of steps. The attacks are constrained to lie in an ball with , and the step-size was taken to be (except for the case , in which case we set ). Furthermore, each attack was randomly initialized in the ball.

In Table 1 we present the results for white-box attacks generated using the ResNeXt-101 model. The attack details are as follows. The number of PGD iterations was taken to be , and the results in this table were computed by averaging over 50,000 distinct attacks, one for each of the images in the ImageNet validation set. Both the accuracy and average cost are evaluated for the two prediction methods discussed above, maximum probability and minimum cost. The column abbreviations are MP Acc - maximum probability prediction accuracy, MP Cost - maximum probability average cost, MC Acc - minimum cost prediction accuracy, MC Cost - minimum cost prediction average cost. The

values indicate the 95% confidence intervals, which were computed by assuming that the means are normally distributed.

Attack Type MP Acc. (%) MP Cost MC Acc (%) MC Cost
clean images
random targets
max cost targets
maxi-min cost
Table 1: White-box attacks

There are a number of interesting observations to make. First, it is unsurprising that the accuracy is similar for both types of targeted attacks when the defending network makes maximum probability predictions, since in this case the cost-sensitive targeted attacks represent a fairly large subset of random targeted attacks. However, it is surprising that the cost-sensitive targeted attacks do such a poor job of increasing the cost for both types of predictions. This illustrates that for adversarially-resistant networks such as those of xie2018feature , targeted attacks are a poor way to increase the cost. The maxi-min cost attack outperforms all others when it comes to increasing the cost, although it unsurprisingly leads to fewer overall errors. The increase in cost is quite dramatic for a defending network making maximum probability predictions, and although the effect is less significant for minimum cost predictions, it still far outperforms either targeted attack.

We present additional results for black-box attacks and variable attack strength in Appendix A. The black-box attacks performed similarly to the white-box attacks, although they were (predictably) slightly less effective overall. Increasing significantly improved the performance of the attacks.

5 Calibration

In many machine learning applications, the only output of a classifier that is used is the class prediction. However, there are many scenarios in which the probability estimates

are also used. Cost-sensitive learning is one such example as the minimum cost prediction, Eq. 2, depends upon . A perfect classifier would place all the probability mass on the correct label, i.e. , and the minimum cost prediction would be .555Recall that we are assuming that the cost matrix satisfies , with equality if and only if . For imperfect classifiers, a desirable property of the probability estimates is that they be calibrated niculescu2005predicting . A classifier is said to be calibrated if the prediction accuracy agrees with the probability estimates. For example, whenever a calibrated classifier makes a prediction of class for an input with , it will be correct on average 90% of the time. As a result, the probability estimates of calibrated classifiers may be interpreted as confidences.

Both the minimum cost prediction, Eq. 2, and the maximum minimum expected cost attack, Eq. 11, depend directly on the probability estimates , and so it is natural to wonder if calibration might significantly affect the results, for example by making the minimum cost predictions more robust, or the maximum minimum expected cost attack more effective. Both the attacker and the defender may separately elect to calibrate leading to a total of four possible scenarios. The scenario where neither party calibrates was treated in the previous section, and in Appendix C we present results for remaining scenarios (defender calibrates, attacker calibrates, and both calibrate). We also provide details on the temperature-scaling calibration method used in Appendix B.

6 Game theoretic analysis

In the above sections and in the appendices we have considered a total of 6 different attacks (targeted with random targets, targeted with cost-sensitive targets, and the maxi-min attack, each of which can be either generated using a calibrated or an uncalibrated network), as well as 4 types of predictions (maximum probability or minimum cost, each of which may be made using a calibrated or an uncalibrated network). A convenient framework for analyzing the resulting 24 possible scenarios is game theory.

The attacker/defender set-up considered here may be formulated as a finite zero-sum two-player game. The pay-off of the attacker is the average cost, and the defender’s pay-off is the negative average cost. The pay-off matrix for this game may be obtained using the uncalibrated results of Table 1, together with the calibrated results presented in Table 4,  5,  6 in Appendix C. Here, stands for “maximum probability", for "minimum cost", for “targeted with random targets", for “targeted with cost-sensitive targets", and for “maxi-min". Notice that the first two rows are identical - the temperature scaling calibration method used does not affect the maximum probability prediction, and therefore it also does not affect the average misclassification costs).

, , , , , ,
Defender , 8.44 8.43 8.45 8.45 13.94 13.97
, 8.44 8.43 8.45 8.45 13.94 13.97
, 2.94 2.95 2.98 3.00 3.50 4.16
, 3.21 3.22 3.25 3.25 3.38 3.39
Table 2: Attacker’s Pay-off matrix

For this simple game, there is a single pure strategy Nash equilibrium (shown in bold in Table 2), which is that the defender makes calibrated minimum cost predictions , and the attacker makes calibrated maxi-min attacks . Note that is a dominant strategy for the attacker, but is not dominant for the defender.

The result of this simple game theory analysis is that, in terms of the average cost, both parties should calibrate, minimum cost predictions are better than maximum probability ones, and the best attack is the maxi-min attack. These conclusions may well change with the many factors that went into this analysis - the cost matrix, the underlying classification problem, the strength of the attacks (measured in terms of and the size of the attack set ), etc. However, this overall framework for comparing strategies should be generally applicable. It is possible that in more complicated scenarios the Nash equilibrium will be a mixed strategy, as opposed to the pure strategy found here.

7 Conclusions and future directions

Safety critical systems are not likely to operate by simply selecting the most likely outcomes; they will need to consider cost of those outcomes and determine the probability thresholds for their predictions accordingly. At the same time, attacks on these cost-sensitive models are particularly important to study because of the critical nature of these systems. We demonstrated several white-box and black-box attacks on cost-sensitive classifiers built from state-of-the-art adversarially-resistant ResNet image classifiers. These classifiers were made resistant by training them on targeted adversarial examples, and we find that they are still vulnerable to attacks designed to increase the expected cost.

While our experimental results were generated for image classification systems, our general framework should apply more broadly to any classification problem. Cost-sensitive classifiers and attacks thereon can easily be envisioned for text analysis (e.g. be sure not to miss terrorist sentiments) or industrial plant operation (e.g. be sure not to miss irregular signals and alerts that lead to accidents). In fact, most applications are not indifferent between different types of misclassifications, making cost-sensitivity broadly applicable. When those applications are safety-critical, an analysis of the efficacy of attacks and defenses should be carried out.

Lastly, we conclude with some directions for future work. Much of this work implicitly assumes that both parties (the defender and the attacker) know the cost matrix. In practice, it may be hard to convert an implicit value system based on possibly vague and loosely-shared principles into an explicit numerical matrix. Even when such a task is achievable, there are many scenarios where the attacker would not be expected to have access to this information. Thus, one area of future work involves studying the effect of imperfect knowledge of the cost-matrix for the attacker, and whether the attacker can learn to infer the cost-matrix by observing the classifier predictions (and in turn using this information to construct better attacks). It would also be interesting to study the effect of a noisy cost-matrix, perhaps reflecting the challenges faced by the defender in encoding a value system into a cost matrix.

A second line of work would be to go beyond the pre-trained models of xie2018feature , and to consider other forms of adversarially-resistant models, especially ones for which analytic bounds could be obtained. In particular, it would be very interesting to apply cost-sensitivity to certifiable adversarial robustness cohen2019certified , for which rigorous analytic results are possible. Lastly, it would also be interesting to extend beyond norm-based attacks, and consider more comprehensive attack sets gilmer2018motivating .


We would like to thank our colleagues at RAND with whom we had many fruitful discussions: Jair Aguirre, Caolionn O’Connnell, Edward Geist, Justin Grana, Christian Johnson, Osonde Osoba, Éder Sousa, Brian Vegetabile and Li Ang Zhang. This work was funded by RAND Project Air Force, contract number FA7014-16-D-1000.


Appendix A Additional results for uncalibrated attacks and predictions

In Sec. 4.2 we presented results for white-box attacks, with neither party calibrating. These attacks were both generated by and submitted to the ResNeXt-101 model of xie2018feature . Results for black-box attacks may be obtained by submitting these same attacks to the other two adversarially-robust models released by xie2018feature , the ResNet-152 DeNoise and the ResNet-152 Baseline models. These are shown in Table 3, again for . These results are qualitatively similar to the white-box results, which demonstrates the transferability of adversarial attacks aimed at increasing the cost as well as the overall classification error.

Attack Type MP Acc. (%) MP Cost MC Acc (%) MC Cost
ResNet-152 Denoise
clean images
random targets
max cost targets
maxi-min cost
ResNet-152 Baseline
clean images
random targets
max cost targets
maxi-min cost
Table 3: Black-box adversarial attacks

In addition to studying the transferability of attacks, we also investigated the dependence of the white-box attack efficacy on . To this end, we generated 10,000 attacks with ranging from 10 to 1000. The results are plotted below in Fig. 3, which shows the cost and accuracy for both types of predictions (maximum probability (MP) and minimum cost (MC)). The plots indicate that in many cases increasing number of steps to about 100 or 200 significantly improves the efficacy of the attack. In particular, larger values of allows the targeted attacks with cost-sensitive targets to outperform the attacks with random targets in all cases. Additionally, with additional steps the efficacy of the maxi-min attack decreases relative to the other attacks against a minimum cost classifier, as shown in the bottom-right figure. Against a maximum probability classifier, the maxi-min attack is far more effective at increasing the cost, as shown in the bottom-left figure.

Figure 3: The accuracy and cost as a function of for the 3 different attacks considered here, and for both types of predictions: maximum probability (MP) and minimum cost (MC). Adversarial examples were generated for , with the output saved at intermediate values. Each curve represents an average over 10,000 adversarial examples, each for a different unperturbed “clean” image, and 95% confidence intervals have been added around the mean. As is especially evident in the bottom-right plot, even with 10,000 images the confidence intervals are still quite large. Our analysis would benefit from larger samples sizes, which are unfortunately not practical given our computational resources and the time required to generate attacks with large values of .

Appendix B Temperature scaling calibration

The calibration of neural networks was originally studied in niculescu2005predicting . The issue was recently revisited for more modern architectures in guo2017calibration , and we shall adopt their methodology.

The extent to which a classifier is well-calibrated may be measured by the Expected Calibration Error (ECE) naeini2015obtaining , which is defined as


Here, with represents a binning of predictions and is the total number of samples. Predictions are grouped into bin if their confidence (i.e. probability estimate ) lies within the interval . Within each bin, the overall accuracy and average confidence are computed. An ECE of 0 indicates that the classifier is perfectly calibrated.

There are many techniques for calibrating a classifier. Perhaps the simplest is temperature scaling, in which the softmax operation relating the logits

to probabilities is modified via a temperature term as follows:


For , this reduces to the usual softmax operation. For , the probabilities are squeezed to become closer to one another, and for the probabilities are pushed apart so that there is a wider disparity between them. The extreme limit of

corresponds to a uniform distribution, and the limit

places all probability mass on the most probable label. An important property of temperature scaling is that it preserves the ordering of the probabilities. For example, the temperature scaling cannot change the sign of the relative log probabilities. Temperature scaling may be used to calibrate a classifier by using a separate validation set to find the optimal temperature which minimizes the ECE error, and then using this temperature to calibrate the probability estimates on the test set data.

Both the minimum cost prediction, Eq. 2, and the maximum minimum expected cost attack, Eq. 11, depend directly on the probability estimates , and so it is natural to wonder if calibration might significantly affect the results, for example by making the minimum cost predictions more robust, or the maximum minimum expected cost attack more effective. We investigated this issue for the white-box attacks in which both the attacking and defending network was the pre-trained ResNeXt-101 model of xie2018feature . First, we evaluated the calibration of the ResNeXt-101 model, using 5000 images, representing 10% of the full validation set. The ECE was found to be 0.055, representing a fairly well-calibrated classifier. To gain a better sense for the calibration, in Fig. 4 below we plot the so-called reliability diagram guo2017calibration showing vs. .

Figure 4: Reliability diagram depicting the calibration of the ResNeXt-101 network when evaluated on the first 5000 images of the ImageNet validation set (representing 10% of the full validation set). The gap represents the quantity within the absolute value sign in Eq. 13. The Expected Calibration Error (ECE) is 0.055, corresponding to a reasonably well-calibrated classifier.

The above reliability diagram and ECE value of 0.055 used the standard softmax operation, i.e. . Allowing to vary, an optimal value of ECE was found at the calibration temperature .

Appendix C Calibration scenarios

The calibration temperature of found above could be used by the defender, the attacker, or both. The defender would be motivated to use calibrated probabilities so that their minimum cost predictions would be (hopefully) more accurate, and similarly the attacker would be motivated to use calibrated probabilities to generate more effective attacks. Thus, in the tables below we show results for the case where the defender calibrates but the attacker does not (Table 4), the case where the defender does not calibrate but the attacker does (Table 5), and the case in which both defender and attacker calibrate (Table 6). The case in which neither party calibrates is covered above in Table 1. In all cases, the same calibration temperature was used, and the results in the tables correspond to an average over the 45,000 validation images not used in the calibration step.

Attack Type MP Acc. (%) MP Cost MC Acc (%) MC Cost
clean images
random targets
max cost targets
maxi-min cost
Table 4: Defender calibrates (white-box attack)
Attack Type MP Acc. (%) MP Cost MC Acc (%) MC Cost
clean images
random targets
max cost targets
maxi-min cost
Table 5: Attacker calibrates (white-box attack)
Attack Type MP Acc. (%) MP Cost MC Acc (%) MC Cost
clean images
random targets
max cost targets
maxi-min cost
Table 6: Both attacker and defender calibrate (white-box attack)

In discussing the results, let us first draw attention to the impact of calibration on the clean images. The maximum probability statistics are unaffected, which is to be expected since the temperature scaling method of calibration used here cannot change the maximum probability prediction.666The astute reader will have noticed that there are in fact slight differences between the MP results for clean un-calibrated and calibrated images. This are due to the fact that the averages computed in this section are over 45,000 images, as opposed to the 50,000 used in the previous section. For the minimum cost predictions, the accuracy drops a non-trivial amount (from 61.2% to 57.5%) and the cost decreases slightly.

Moving next to consider the effect of calibration on the efficacy of the attacks, the results show that calibration (of either party) has a significant impact on the minimum cost predictions, but not on the maximum probability ones. In discussing the results, we will take the perspective of the defender, and assume that the attacker is held fixed. Consider first the case of an uncalibrated attacker. The results show that the two types of targeted attacks are much more effective against a calibrated minimum cost defender than an uncalibrated one. The accuracy decreases (from about 41% to about 34%) and the cost increases (from about 3 to about 3.2). Interestingly, the trend is reversed for the maxi-min attack. This attack is more effective against an uncalibrated minimum cost classifier (3.50 compared to 3.38 for a calibrated one). Thus, whether the defender should calibrate or not depends on the attack type.

Consider next the case in which the attacker calibrates. Once again, the maximum probability statistics are only very weakly affected by the defender’s decision to calibrate. For the minimum cost predictions, it is again the case that the targeted attacks are more effective against a calibrated defender, whereas the maxi-min attack is rendered less effective by calibration. Here the distinction is even more pronounced than before. The cost for an uncalibrated minimum cost classifier is 4.16, and drops to 3.39 after calibration.

To summarize, calibration is important for minimum cost classifiers. A defender can reduce their vulnerability to a maxi-min attack designed to increase the expected cost by calibrating, and similarly an attacker can increase the effectiveness of the maxi-min attack against a minimum cost defender by calibrating. Against targeted attacks, however, calibration can decrease the defender’s performance. In Sec. 6 we use game theory to conduct a more systematic analysis of the various strategies available to both the attacker and defender.