The effectiveness of feature attribution methods and its correlation with automatic evaluation scores

Explaining the decisions of an Artificial Intelligence (AI) model is increasingly critical in many real-world, high-stake applications. Hundreds of papers have either proposed new feature attribution methods, discussed or harnessed these tools in their work. However, despite humans being the target end-users, most attribution methods were only evaluated on proxy automatic-evaluation metrics. In this paper, we conduct the first, large-scale user study on 320 lay and 11 expert users to shed light on the effectiveness of state-of-the-art attribution methods in assisting humans in ImageNet classification, Stanford Dogs fine-grained classification, and these two tasks but when the input image contains adversarial perturbations. We found that, in overall, feature attribution is surprisingly not more effective than showing humans nearest training-set examples. On a hard task of fine-grained dog categorization, presenting attribution maps to humans does not help, but instead hurts the performance of human-AI teams compared to AI alone. Importantly, we found automatic attribution-map evaluation measures to correlate poorly with the actual human-AI team performance. Our findings encourage the community to rigorously test their methods on the downstream human-in-the-loop applications and to rethink the existing evaluation metrics.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 22

page 23

page 24

page 25

page 26

page 35

page 37

page 42

09/13/2021

Perturbation CheckLists for Evaluating NLG Evaluation Metrics

Natural Language Generation (NLG) evaluation is a multifaceted task requ...
09/28/2021

Discriminative Attribution from Counterfactuals

We present a method for neural network interpretability by combining fea...
10/04/2021

Fine-Grained Neural Network Explanation by Identifying Input Features with Predictive Information

One principal approach for illuminating a black-box neural network is fe...
12/06/2021

What I Cannot Predict, I Do Not Understand: A Human-Centered Evaluation Framework for Explainability Methods

A multitude of explainability methods and theoretical evaluation scores ...
09/22/2017

Novel Evaluation Metrics for Seam Carving based Image Retargeting

Image retargeting effectively resizes images by preserving the recogniza...
02/02/2019

FDI: Quantifying Feature-based Data Inferability

Motivated by many existing security and privacy applications, e.g., netw...
06/25/2019

Learning Explainable Models Using Attribution Priors

Two important topics in deep learning both involve incorporating humans ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Why did a computer vision system suspect that a person had breast cancer

Wu et al. (2019), or was an US capitol rioter fbi , or a shoplifter law ? The explanations for such high-stake predictions made by existing Artificial Intelligence (AI) agents can impact human lives in various aspects, from social Doshi-Velez & Kim (2017), to scientific Nguyen et al. (2015), and legal law ; Goodman & Flaxman (2017); Doshi-Velez et al. (2017)

. A common medium for explaining an image classifier’s decisions is an

attribution map (AM) Bansal et al. (2020b) i.e. a heatmap that highlights the input pixels that are important for or against a predicted label. Attribution maps, a.k.a. “saliency maps”, can be useful in localizing malignant tumors in x-ray images Rajpurkar et al. (2017) or detecting biases in image classifiers Lapuschkin et al. (2016).

Since 2013 Simonyan et al. (2013), hundreds of research papers have either used attribution methods or proposed new ones Covert et al. (2020); Das & Rad (2020). Yet, it remains largely unknown how effective state-of-the-art AMs are in improving the performance of human-AI team on computer vision tasks. Given that humans are the target users of explanations, answering this question is critical for the community to produce useful methods. However, most attribution methods were often only evaluated on proxy automatic-evaluation metrics such as pointing game Zhang et al. (2018), weakly-supervised localization Zhou et al. (2016), or deletion Petsiuk et al. (2018), which may not necessarily correlate with human-AI team performance on a downstream task.

In this paper, we conducted the first, large-scale user study to shed light on the effectiveness of AMs in assisting humans in single-label, image classification, which is the task that most attribution methods were designed for. We asked 320 lay and 11 expert users to decide whether machine decisions are correct after observing an input image, top-1 classification outputs, and explanations (Fig. 1). We ran experiments on both real as well as adversarial images Szegedy et al. (2013) and on both coarse 1000-class and fine-grained 120-class image classification tasks, i.e. ImageNet Russakovsky et al. (2015) and Stanford Dogs Khosla et al. (2011), respectively. Our main findings include:

  1. AMs are, surprisingly, not more effective than nearest-neighbors (here, 3-NN) in improving human-AI team performance on both ImageNet and fine-grained dog classification (Sec. 3.1).

  2. On fine-grained dog classification, a harder task for a human-AI team than 1000-class ImageNet, presenting AMs to humans interestingly does not help but instead hurts the performance of human-AI teams, compared to AI alone without humans (Fig. 2b).

  3. On adversarial ImageNet images, 3-NN is more effective than all tested attribution methods in helping humans to reject these incorrect AI predictions (Sec. 3.1).

  4. On adversarial Stanford Dog images which are very hard for humans to label, presenting confidence scores only to humans is the most effective method compared to all other explanations tested including AMs and 3-NN examples (Sec. 3.1).

  5. Despite being popularly used in the literature, common automatic evaluation metrics—here, Pointing Game Zhang et al. (2018), localization errors Zhou et al. (2016), and IoU Zhou et al. (2016)—correlate poorly with the actual human-AI team performance (Sec. 3.4).

  6. According to a study on 11 AI-expert users, 3-NN is significantly more useful than GradCAM Selvaraju et al. (2017), a state-of-the-art attribution method for ImageNet classification (Sec. 3.5).

Figure 1: Given an input image, its top-1 predicted label (here, “lorikeet”) and confidence score (A), we asked the user to decide Yes or No whether the predicted label is accurate (here, the correct answer is No). The accuracy of users in this case is the performance of the human-AI team without visual explanations. We also compared this baseline with the treatments where one attribution map (B) or a set of three nearest neighbors (C) is also provided to the user (in addition to the confidence score).

2 Methods

2.1 Image classification tasks

To evaluate human-AI team performance on image classification, for each image, we also presented to a human the following items: (1) an AI’s top-1 predicted label, (2) its confidence score, and, optionally (3) a visual explanation that is generated in attempt to explain the predicted label. The user was then asked to decide whether AI’s predicted label is correct or not (see Fig. 1). We performed our experiments on two image datasets of ImageNet and Stanford Dogs, which have varying difficulty.

ImageNet

We tested human-AI teams on ImageNet Russakovsky et al. (2015), which is an image classification task that most attribution methods were tested on or designed for Zhou et al. (2016); Selvaraju et al. (2017); Agarwal & Nguyen (2020); Petsiuk et al. (2018). ImageNet is a 1.3M-image dataset spanning across 1000 diverse categories from natural to man-made entities Russakovsky et al. (2015).

Stanford Dogs

To test whether our findings generalize to a fine-grained classification task that is harder to human-AI teams than ImageNet, we repeated the experiments on the 120-class Stanford Dogs Khosla et al. (2011) dataset, a subset of ImageNet. This dataset is challenging due to the similarity between dog species and the large intra-class variation Khosla et al. (2011). Compared to ImageNet, Stanford Dogs is expected to be harder to human-AI teams because lay users often have significantly less prior knowledge about fine-grained dog breeds compared to a wide variety of everyday objects in ImageNet.

2.2 Image classifiers

We took ResNet-34 He et al. (2016) pretrained on ImageNet (73.31% top-1 accuracy) from torchvision Marcel & Rodriguez (2010) as the target classifier (i.e., the AI) for both ImageNet and Stanford Dogs classification tasks because the 1000-class ImageNet that the model was pretrained on includes 120 Stanford Dogs classes. The visual explanations in this paper were generated to explain this model’s predicted labels. We chose ResNet-34 because ResNets were widely used in feature attribution research Selvaraju et al. (2017); Fong et al. (2019); Petsiuk et al. (2018); Agarwal & Nguyen (2020); Lu et al. (2020).

2.3 Visual explanation methods

To understand the causal effect of adding humans in the loop, we compared the performance of a standard AI-only system (i.e. no humans involved) and human-AI teams. In each team, besides an input image and a corresponding top-1 label, humans are also presented with the corresponding confidence score from the classifier, and a visual explanation (i.e. one heatmap or three 3-NN images in; see examples from our user-study in Sec. A12). In total, we evaluated the following six methods.

AI-only

A common way to automating the process of accepting or rejecting a predicted label is via confidence-score thresholding Bendale & Boult (2016). That is, a top-1 predicted label is accepted if its associated confidence score is , a threshold. We found the optimal confidence threshold that produces the highest accepting accuracy on the entire validation set by sweeping across threshold values i.e. at a 0.05 increment. The optimal values are 0.55 for ImageNet and 0.50 for Stanford Dogs (details in Sec. A6).

Confidence scores only

To understand the impact of visual explanations on human decisions, we also used a baseline where users are asked to make decisions given no explanations (i.e. given only the input image, its top-1 predicted label and confidence score). To our best knowledge, this baseline has not been studied in computer vision, but has been shown useful, in other domains, for improving human-AI team accuracy or user’s trust Zhang et al. (2020); Bansal et al. (2020a).

GradCAM and Extremal Perturbation (EP)

We chose GradCAM Selvaraju et al. (2017) and Extremal Perturbation (EP) Fong et al. (2019) as two representatives for state-of-the-art attribution methods (Fig. 1). Representing for the class of white-box, gradient-based methods Zhou et al. (2016); Chattopadhay et al. (2018); Rebuffi et al. (2020), GradCAM relies on gradients and the activation map at the last conv layer of ResNet-34 to compute a heatmap. GradCAM passed a weight-randomization sanity check Adebayo et al. (2018) and often obtained competitive scores in proxy evaluation metrics (see Table 1 in Fong et al. (2019) and Table 1&2 in Elliott et al. (2019)).

In contrast, EP is a representative for perturbation-based methods Fong & Vedaldi (2017); Agarwal & Nguyen (2020); Covert et al. (2020). EP searches for a set of

% of input pixels that maximizes the target-class confidence score. We followed the authors’ best hyperparameters—summing over four binary masks generated using

and a Gaussian smoothing kernel with an std equal to 9% of the shorter side of the image (see the code EP2 ). We used the TorchRay package tor to generate GradCAM and EP attribution maps.

We initially also included the vanilla Gradient Simonyan et al. (2013) and SHAP Lundberg & Lee (2017) methods but discarded them from our study since their AMs are too noisy to be useful to users in a pilot study.

Salient object detection (SOD)

To assess the need for heatmaps to explain a specific classifer’s decisions, we also considered a classifier-agnostic heatmap baseline. That is, we used a pre-trained state-of-the-art salient-object detection (SOD) method called PoolNet Liu et al. (2019a), which uses a ResNet-50 backbone pre-trained on ImageNet. PoolNet was trained to output a heatmap that highlights the salient object in an input image (Fig. 3)— a process that does not take our image classifier into account. Thus, SOD serves as a strong saliency baseline for GradCAM and EP.

3-Nn

To further understand the pros and cons of AMs, we compared them to a representative, prototype-based explanation method (e.g. Nguyen et al. (2019); Chen et al. (2019); Nauta et al. (2020)). That is, for a given input image and a predicted label (e.g., “lorikeet”), we show the top-3 nearest images to by retrieving them from the same ImageNet training-set class (Fig. 1). To compute the distance between two images, we used the distance in the feature space of the last conv layer of ResNet-34 classifier (i.e.

per PyTorch definition

res or in He et al. He et al. (2016)). We chose as our pilot study found it to be the most effective among the four main conv layers (i.e. to ) of ResNet-34.

While

-NN has a wide spectrum of applications in machine learning and computer vision

Shakhnarovich et al. (2005), the effectiveness of human-AI collaboration using prototype-based explanations has been rarely studied Rudin et al. (2021). To the best of our knowledge, we provided the first user study that evaluates the effectiveness of prototype-based explanations (here, 3-NN) on human-AI team performance.

2.4 User-study design

2.4.1 Participants

Our user-study experiments were designed and hosted on Gorilla Anwyl-Irvine et al. (2020). We recruited lay participants via Prolific Palan & Schitter (2018) (at $10.2/hr), which is known for a high-quality user base Peer et al. (2017). Each Prolific participant self-identified that English is their first language, which is the only demographic filter we used to select users on Prolific. Over the course of two small pilot studies and one main, large-scale study, we collected in total over 466 complete submissions (each per user) after discarding incomplete ones.

In the main study, after filtering out submissions by validation scores (described in Sec. 2.4.2), we had 161 and 159 qualified submissions for our ImageNet and Dogs experiments, respectively. Each of the 5 methods (i.e. except for AI-only) was experimented by at least 30 users, and each (image, explanation) pair was seen by at least two users (statistics of users and trials in Sec. A8).

2.4.2 Tasks

Our Gorilla study contains three main sets of screens: (1) Introduction—where each user is introduced to the study and relevant rules (details in Sec. A10); (2) Training; and (3) Test. Each user is randomly assigned a set of test images and only one explanation method (e.g. GradCAM heatmaps) to work with during the entire study.

Training

To familiarize with the image classification task (either ImageNet or Stanford Dogs), users were given five practice questions. After answering each question, they were shown feedback with groundtruth answers. In each training screen, we also described each component in a screen via annotations (see example screens in Sec. A10.2).

Validation and Test

After training, each user was asked to answer 40 Yes/No questions in total. Before each question, a user was provided with a short WordNet Miller (1995) definition of the predicted label and three random training-set images corrected-classified into the predicted class with a confidence score (see an example in Fig. A5). To control the quality of user responses, out of 40 trials, we used 10 trials as validation cases where we carefully chose the input images such that we expected participants who followed our instructions to answer correctly (details in Sec. A10.3). We excluded those submissions that had below 10/10 and 8/10 accuracy on the 10 validation trials from the ImageNet and Dogs experiments, respectively. For the remaining (i.e., qualified) submissions, we used the results of their 30 non-validation trials in our study.

2.4.3 Images

We wish to understand the effectiveness of visual explanations when AIs are correct vs. wrong, and when AIs face real vs. adversarial examples Szegedy et al. (2013). For both ImageNet and Dogs experiments, we used the ResNet-34 classifier to sample three types of images: (1) correctly-classified, real images; (2) misclassified, real images; and (3) adversarial images (i.e. also misclassified). In total, we used 3 types 150 images = 450 images per dataset. Each image was then used to generate model predictions and explanations for comparing the 6 methods described in Sec. 2.3.

Filtering   From the 50K-image ImageNet validation set, we sampled images for both ImageNet and Dogs experiments. To minimize the impact of low-quality images to users’ performance, we removed all 900 grayscale images and 897 images that have either width or height smaller than 224 px, leaving 48,203 and 5,881 images available for use in our ImageNet and Dogs experiments, respectively. For Dogs, we further excluded all 71 dog images mislabeled by ResNet-34 into non-dog categories (examples in Sec. A11) because they can trivialize explanations, yielding 5,810 Dogs images available for sampling.

Sampling natural images   To understand human-AI team performance at varying levels of difficulty for humans, we randomly selected images from the pool of filtered, real images into three sets: Easy (E), Medium (M), and Hard (H).

Hard images are those correctly labeled by the classifier with a low confidence score (i.e., ) and mislabeled with high confidence (i.e., ). Vice versa, the Easy set contains those correctly labeled with high confidence and mislabeled with low confidence. The Medium set contains both correctly and incorrectly labeled images with a confidence score i.e. when the AI is unsure (see confidence-score distributions in Sec. A7). In each set (E/M/H), we sampled 50 images correctly-labeled and 50 mislabeled by the model. In sum, per dataset, there are 300 natural images divided evenly into 6 controlled bins (see Fig. A1 for the ratios of these bins in the original datasets).

Generating adversarial images   After the filtering above, we took the remaining real images to generate adversarial examples via Foolbox Rauber et al. (2017) for the ResNet-34 classifier using the Project Gradient Descent (PGD) framework Madry et al. (2017) with an bound for 40 steps, each of size of . We chose this setup because at weaker attack settings, most adversarial images became correctly classified after being saved as a JPEG file Liu et al. (2019b), defeating the purpose of causing AIs to misbehave. Here, for each dataset, we randomly sampled 150 adversarial examples (e.g., the input image in Fig. 1) that are misclassified at the time presented to users in the JPEG format and often contain so small artifacts that we assume to not bias human decisions. Following the natural-image sampling, we also divided the 150 adversarial images into three sets (E/M/H), each containing 50 images.

2.5 Automatic evaluation metrics for attribution maps

Many attribution methods have been published; however, most AMs were not tested on end-users but instead only assessed via proxy evaluation metrics. We aim to measure the correlation between three common metrics—Pointing Game Zhang et al. (2018), Intersection over Union (IoU) Zhou et al. (2016), and weakly-supervised localization (WSL) Zhou et al. (2016)—with the actual human-AI team performance in our user study.

All three metrics are based on the assumption that an AM for explaining a predicted label should highlight an image region that overlaps with a human-annotated bounding box (BB) for that category . For each of 300 real, correctly-classified images from each dataset (described in Sec. 2.4.3), we obtained its human-annotated BB from ILSVRC 2012 Russakovsky et al. (2015) for using in the three metrics.

Pointing game   Zhang et al. (2018) is a common metrics often reported in the literature (e.g. Selvaraju et al. (2017); Fong et al. (2019); Rebuffi et al. (2020); Du et al. (2018); Petsiuk et al. (2018); Wang et al. (2020)). For an input image, a generated attribution heatmap is considered a correct explanation (i.e. a hit) if its highest-intensity point lies inside the human-annotated BB. Otherwise, it is a miss. The accuracy for an explanation method is computed by averaging over all images. We used the TorchRay implementation of Pointing Game tor and its default hyperparameters (tolerance = 15).

Intersection over Union   The idea is to compute the agreement in Intersection over Union (IoU) Zhou et al. (2016)

between a human-annotated BB and a binarized attribution heatmap. A heatmap was binarized at the method’s optimal threshold

, which was found by sweeping across values .

Weakly-supervised localization (WSL)   is based on IoU scores Zhou et al. (2016). WSL counts a heatmap correct if its binarized version has a BB that overlaps with the human-labeled BB at an IoU . WSL is also commonly used e.g. Agarwal & Nguyen (2020); Selvaraju et al. (2017); Du et al. (2018). For both GradCAM and EP, we found the binarization threshold corresponds to their best WSL scores on 150 correctly-classified images (i.e., excluding mislabeled images because human-annotated BBs are for the groundtruth labels).

3 Results

3.1 On natural images, how effective are attribution maps in human-AI team image classification?

We wish to understand the effectiveness of attribution maps (GradCAM and EP) compared to four baselines (AI-only, Confidence, SOD, and 3-NN) in image classification by human-AI teams. We compared 6 methods on 2 natural-image sets: ImageNet and Stanford Dogs.

Experiment

Because a given image can be mapped to one of the 6 controlled bins as described in Sec. 2.4.3, for each of the 6 methods, we computed its human-AI team accuracy for each of the 6 bins where each bin has exactly 50 images (Sec. A1

reports per-bin accuracy scores). To estimate the overall accuracy of a method on the

original ImageNet and Dogs dataset, we computed the weighted sum of its per-bin accuracy scores where the weights are the frequencies of images appearing in each bin in practice. For example, 70.93% of the Dogs images are correctly-classified with a high confidence score (Fig. A1; Easy Correct).

(a)                              (b)                              (c)                              (d)

Figure 2: 3-NN is consistently among the most effective in improving human-AI team accuracy (%) on natural ImageNet images (a), natural Dog images (b), and adversarial ImageNet images (c). On the challenging adversarial Dogs images (d), using confidence scores only helps humans the most in detecting AI’s mistakes compared to using confidence scores and one visual explanation. Below each sample dataset image is the top-1 predicted label (which was correct or wrong) from the classifier.

ImageNet results   On ImageNet, human-AI teams where humans use heatmaps (GradCAM, EP, or SOD) and confidence scores outperformed AI-only by an absolute gain of 6–8% in accuracy (Fig. 2a; 80.79% vs. 88.77%). That is, when teaming up, humans and AI together can achieve a better performance than AI alone. However, only half of such improvement can be attributed to the heatmap explanations (Fig. 2a; 84.79% vs. 88.77%). That is, users when presented with the input image and the classifier’s top-1 label and confidence score already obtained 4% boost over AI alone (84.79% vs. 80.79%).

Dogs results   Interestingly, the trend did not carry over to fine-grained dog classification. On average, humans when presented with (1) confidence scores only or (2) confidence scores with heatmaps all underperformed the accuracy of AI-only (Fig. 2b; 81.14% vs. 76.45%). An explanation is that ImageNet contains 50% of man-made entities, which contain many everyday objects that users are familiar with. Therefore, the human-AI teaming led to a substantial boost on ImageNet. In contrast, most lay users are not dog experts and therefore do not have the prior knowledge necessary to help them in dog identification, resulting in even worse performance than a trained AI (Fig. 2b; 81.14% vs. 76.45%). Interestingly, when providing users with nearest neighbors, human-AI teams with 3-NN outperformed all other methods (Fig. 2b; 82.88%) including the AI-only.

Both datasets   On both ImageNet and Dog distributions, 3-NN is among the most effective. On average over 6 controlled bins, 3-NN also outperformed all other methods by a clear margin of 2.71% (Sec. A2 & A4). Interestingly, SOD users tend to reject more often than other users (Sec. A9), inadvertently causing a high human-AI team accuracy on AI-misclassified images (Sec. A13.5).

3.2 On adversarial images, how effective are attribution maps in human-AI team image classification?

As adversarial examples are posing a big threat to AI-only systems Papernot et al. (2016), here, we are interested in testing how effective explanations are in improving human-AI team performance over AI-only, which is assumed to be 0% accurate under adversarial attacks. That is, on ImageNet and Dogs, we compare the accuracy of human-AI teams on 150 adversarial examples, which all caused the classifier to misclassify. Note that a user was given a random mix of natural and adversarial test images and was not informed whether an input image is adversarial or not.

Adversarial ImageNet   On both natural and adversarial ImageNet, AMs are on-par or more effective than showing confidence scores alone (Fig. 2c). Furthermore, the effect of 3-NN is a consistent +4% gain compared to Confidence (Fig. 2a & c; Confidence vs. 3-NN). Aligned with the results on natural ImageNet and Dogs, 3-NN remains the best method on adversarial ImageNet (Fig. 2c; 75.07%). See Sec. A13.4 for adversarial images that were only correctly-rejected by 3-NN users but not others.

Adversarial Dogs   Interestingly, on adversarial Dogs, adding an explanation (either 3-NN or a heatmap) tend to cause users to agree with the model’s incorrect decisions (Fig. 2d). That is, heatmaps often only highlighted a coarse body region of a dog without pinpointing an exact feature that might be explanatory. Similarly, 3-NN often shows examples of an almost identical breed to the groundtruth (e.g. mountain dog vs. Gordon setter), which is hard for lay-users to tell apart (see qualitative examples in Sec. A13.6).

3.3 On ImageNet, why is 3-NN more effective than attribution maps?

Analyzing the breakdown accuracy of 3-NN in each controlled set of Easy, Medium, and Hard, we found 3-NN to be the most effective method in the Easy and Hard categories of ImageNet (Table A3).

Easy   When AI mislabels with low confidence, 3-NN often presents contrast evidence showing that the predicted label is incorrect, i.e. the nearest examples from the predicted class is distinct from the input image (Fig. 1; lorikeets have a distinct blue-orange pattern not found in bee eaters). The same explanation is our leading hypothesis for 3-NN’s effectiveness on adversarial Easy, ImageNet images. More examples in Sec. A13.3.

Hard   Hard images often contain occlusions (Figs. A9 & A12), unusual instances of common objects (Fig. A13), or could reasonably be in multiple categories (Figs. A11 & A10). When the classifier is correct but with low confidence, we found 3-NN to be helpful in providing extra evidence for users to confirm AI’s decisions (Fig. 3). In contrast, heatmaps often only highlight the main object regardless of whether AI is correct or not (Fig. 1; GradCAM). More examples are in Sec. A13.1.

Medium   Interestingly, 3-NN was the best method on the Easy and Hard set, but not on the Medium (Table A3). Upon a closer look, we surprisingly found 63% of misclassified images by 3-NN human-AI teams to have debatable groundtruth ImageNet labels (see Sec. A13.2 for failure cases of 3-NN).

(a) “african hunting dog”: dog-like mammals of Africa
(b) “caldron”: a very large pot used for boiling
Figure 3: Hard images that were only corrected accepted by 3-NN users but not other users of GradCAM, EP, or SOD. Despite the animals in the input image are partially occluded, 3-NN provided closed-up examples of african hunting dogs, enabling users to correctly decide the label (a). Choosing a single label for a scene of multiple objects is challenging. However, 3-NN was able to retrieve a nearest example showing a very similar scene enabling users to accept AI’s correct decisions (b).
See full-res images in Figs. A9 & A10.

3.4 How do automatic evaluation scores correlate with human-AI team performance?

IoU Zhou et al. (2016), WSL Zhou et al. (2016), and Pointing Game Zhang et al. (2018) are three attribution-map evaluation metrics commonly used in the literature, e.g. in Petsiuk et al. (2018); Fong et al. (2019); Fong & Vedaldi (2017); Agarwal & Nguyen (2020); Selvaraju et al. (2017). Here, we measure the correlation between these scores and the actual human-AI team accuracy to assess how high-performance on such benchmarks translate into the real performance on downstream tasks.

Experiment   We took the EP and GradCAM heatmaps of the 150 real images that were correctly-classified (see Sec. 2.4.3) from each dataset and computed their IoU, WSL, and Pointing Game scores111Note that WSL and Pointing Game scores are binary while IoU scores are real-valued. (see Sec. 2.5). We computed the Pearson correlation between these scores and the human-AI team accuracy obtained in the previous human-study (Sec. 2.4).

Results   While EP and GradCAM are state-of-the-art methods under Pointing Game Fong et al. (2019) or WSL Fong & Vedaldi (2017), we surprisingly found the accuracy of users when using these AMs to correlate poorly with the IoU, WSL, and Pointing Game scores of heatmaps. Only in the case of GradCAM heatmaps for ImageNet (Fig. 3(a)), the evaluation metrics showed a small positive correlation with human-AI team accuracy ( for IoU; for WSL; and for Pointing Game). In all other cases, i.e. GradCAM on Dogs; EP on ImageNet and Dogs (Fig. 3(b)–c), the correlation is negligible (). That is, the pointing or localization accuracy of feature attribution methods do not necessarily reflect their explanation effectiveness in helping users making correct decisions in downstream tasks.

(a) GradCAM’s IoU
(b) EP’s IoU
(c) EP’s Pointing Game
Figure 4: The ImageNet localization performance (here, under IoU vs. human-annotated bounding boxes) of GradCAM Selvaraju et al. (2017) and EP Fong et al. (2019) attribution maps poorly correlate with the human-AI team accuracy (y-axis) when users use these heatmaps in image classification. Humans often can make correct decisions despite that heatmaps poorly localize the object (see the range 0.0–0.2 on x-axis).

3.5 Machine learning experts found Nearest-Neighbors more effective than GradCAM

We have found in our previous study on lay-users that AMs can be useful to human-AI teams, but not more than a simple 3-NN in most cases (Sec. 3.1–Sec. 3.3). Here, we aim to evaluate whether such conclusions carry over to the case of machine learning (ML) expert-users who are familiar with feature attribution methods, nearest-neighbors, and ImageNet. Because ML experts have a deeper understanding into these algorithms than lay-users, our expert-study aims to measure the utility of these two visual explanation techniques to ML researchers and practitioners.

As GradCAM was the best attribution method in the lay-user ImageNet study (Fig. 2), we chose GradCAM as the representative for attribution maps and compare it with 3-NN on human-AI team image classification on ImageNet.

Experiment   We repeated the same the lay-user study but on a small set of 11 expert users who are Ph.D. students and postdoctoral researchers in the field of ML and computer vision and from a range of academic institutions. We recruited five GradCAM users and six 3-NN users who are very familiar with feature attribution and nearest-neighbor algorithms, respectively. Similar to the lay-user study, each expert is presented with a set of randomly-chosen 30 images which include both natural and adversarial images.

Results   The experts working with 3-NN performed substantially better than the GradCAM experts (Table 1; mean accuracy of 76.67% vs. 68.00%). 3-NN is consistently more effective on both natural and adversarial image subsets. Interestingly, the performance of GradCAM users also vary 3

more (standard deviation of 8.69% vs. 2.98% in accuracy). Aligned with the lay-user study, here, we found 3-NN to be more effective than AMs in improving human-AI team performance where the users are domain experts familiar with the mechanics of how explanations were generated.

Users Avg. validation Natural Adversarial
accuracy Accuracy Trials Accuracy Trials
GradCAM 5 9.80/10 67.31 70/104 69.57 32/46 68.00 8.69
3-NN 6 9.83/10 78.45 91/116 73.44 47/64 76.67 2.98
Table 1: 3-NN is far more effective than GradCAM in human-AI team image classification of both natural and adversarial images. See the mean () and std () in per-user accuracy over all images.

4 Related Work

Evaluating confidence scores on humans

AI confidence scores have been found to improve user’s trust on AI’s decisions Zhang et al. (2020) and be effective in human-AI team prediction accuracy on several NLP tasks Bansal et al. (2020a). In this work, we do not measure user trust but only the complementary human-AI team performance. To the best of our knowledge, our work is the first to perform such human-evaluation of AI confidence scores for image classification.

Evaluating explanations on humans   In NLP, using attribution methods as a word-highlighter has been found to improve user performance in question answering Feng & Boyd-Graber (2019). Bansal et al. Bansal et al. (2020a) found that such human-AI team performance improvement was because the explanations tend to make users more likely to accept AI’s decisions regardless of whether the AI is correct or not. We found consistent results that users with explanations tend to accept AI’s predicted labels more (see Sec. A9) with the exception of SOD users who reject more.

In image classification, Chu et al. Chu et al. (2020) found that presenting Integrated-Gradient heatmaps to users did not significantly improve the human-accuracy on predicting age from facial photos. Different from Chu et al. (2020), our study tested multiple attribution methods (GradCAM, EP, SOD) and on the ImageNet classification task, which most attribution methods were evaluated on under proxy metrics.

Shen and Huang Shen & Huang (2020) showed users GradCAM Selvaraju et al. (2017), EP Fong et al. (2019), and SmoothGrad Smilkov et al. (2017) heatmaps and measured user-performance in harnessing the heatmaps to identify a label that an AI incorrectly predicts. While they measured the effect of showing all three heatmaps to users, we compared each method separately (GradCAM vs. EP vs. SOD) in a larger study. Similar to Shen & Huang (2020), Alqaraawi et al. Alqaraawi et al. (2020) tested the capability of LRP Binder et al. (2016) attribution maps in helping users understand the decision-making process of AI. Our work differs from the above two papers Shen & Huang (2020); Alqaraawi et al. (2020) in that we did not design our own task to measure human understanding of AIs, but we measured human-AI team performance on a standard downstream task of ImageNet.

5 Discussion and Conclusion

Limitations   An inherent limitation of our study is that it is not possible to control the amount of prior knowledge that a participant has before entering the study. For example, a human with a strong dog expertise may perform better at the fine-grained dog classification. In that case, the utility of explanations is unknown in our study. We attempted to estimate the effect of prior knowledge to human-AI team accuracy by asking each user whether they know a class before each trial. We found prior knowledge to account for 1-6% in accuracy (Sec. A5). Due to COVID and the large scale, our study was done online; which, however, made it infeasible for us to control various physical factors (e.g. user performing other activities during the experiment) compared to a physical in-lab study.

To our knowledge, our work is the first to (1) evaluate human-AI team performance on the common ImageNet classification; (2) assess explanations on adversarial examples; (3) reveal the weak correlation between automatic evaluation metrics (Pointing Game, IoU Zhou et al. (2016), and WSL Zhou et al. (2016)) and the actual team performance. Such poor correlation encourages future interpretability research to take humans into their evaluation and to rethink the current automatic metrics. We also showed the first evidence in the literature that a simple 3-NN can outperform existing attribution maps, suggesting a combination of two explanation types might be useful for future work. The superiority of 3-NN also suggests prototypical Chen et al. (2018); Goyal et al. (2019) and visually-grounded explanations Hendricks et al. (2016, 2018) may be more effective than heatmaps.

References

Appendix A1 Break-down human-AI team accuracy on image subsets

In Table A1, for each subset mentioned in Sec. 2.4.3, we let denote the accuracy on the subset, in which is the bin (E, M, H) and is the prediction result (Correct or Wrong) produced by ResNet-34.

ImageNet Stanford Dogs
E M H E M H
Correct Wrong Correct Wrong Correct Wrong Correct Wrong Correct Wrong Correct Wrong
Confidence 92.23 78.64 88.78 59.22 75.51 43.16 86.67 66.96 74.07 32.77 64.41 25.86
GradCAM 98.02 77.57 90.38 60.38 78.79 40.19 90.29 63.11 84.04 34.62 59.41 19.59
EP 97.14 86.41 84.11 58.18 78.64 37.38 87.63 58.88 79.81 22.22 70.64 15.24
SOD 95.69 82.57 88.60 55.65 66.96 43.75 83.96 70.19 70.18 40.78 59.43 28.04
3-NN 96.64 88.60 86.84 54.95 83.93 44.76 97.46 56.00 76.92 24.55 60.42 18.75
Table A1: Human-AI team accuracy (%) on natural images in 6 controlled bins. .
Controlled-to-original accuracy conversion

While our experiments were not conducted on the real distributions of ImageNet and Dogs, we can still estimate the human accuracy of the explanation methods on the original datasets using the real ratios of subsets () presented in Fig. A1.

From Table A1 and Fig. A1, we measured human accuracy of explanation methods over the original distributions by the following formula:

The estimated accuracy was reported in Fig. 2. It should be noted that the above conversion was applied for natural images only because adversarial images were synthetically generated, so no original distribution exists.

Figure A1: The ratios of image bins () in the controlled and real distributions.

Appendix A2 How effective attribution maps in improving human-AI team accuracy on the controlled distribution?

On average, over the controlled distribution (i.e. 6 bins, each of 50 images), we found that 3-NN is the best method to help end-users categorize ImageNet images. 3-NN outperforms other explanation methods by at least 2.71% (Table A2; 76.59%). However, in fine-grained dog classification, explanations seem to have low utility. All methods scored nearly the same and close to the random baseline (i.e. 50% accuracy).

ImageNet Dogs
Natural Natural
Confidence 73.17 58.33
GradCAM 73.88 58.47
EP 73.39 55.72
SOD 72.25 58.91
3-NN 76.59 56.73
Table A2: Human-AI team accuracy with explanation methods on random images of the controlled distribution. NNs is the most effective in ImageNet experiment, while showing explanations could decrease human accuracy in fine-grained dog classification (Dogs).

Appendix A3 Are attribution maps more effective than 3-NN in improving human accuracy on Easy, Medium, and Hard images of AI on the controlled distribution?

In Table A3, we found 3-NN defeating other attribution methods and other baselines by a wide margin on easy and hard images of ImageNet classification task. It surpassed the second best methods (EP on E and Confidence on H) in each image set by 3.13% on average.

Yet, 3-NN provided very little information for users working with the Dogs experiment. None of attribution methods performed well on easy and hard images of dogs. Because the differences among dog classes are minimal, only rightly pointing out discriminative features could help users.

When AI classifies an image with confidence score around 0.5 (medium images), GradCAM was the best explanation method on both ImageNet and Dogs data, achieving 75.24% and 58.08%, respectively. However, the usefulness of explanation methods for users with dog images was insignificant since the net improvements over the random choice baseline were minimal.

ImageNet Dogs
E M H E M H
Confidence 85.44 73.63 59.59 77.02 52.42 45.30
GradCAM 87.50 75.24 58.74 76.70 58.08 39.90
EP 91.83 70.97 57.62 72.55 51.72 43.46
SOD 89.33 72.05 55.51 77.14 56.22 43.66
3-NN 92.70 71.11 64.98 78.44 50.00 39.58
89.36 72.60 59.29 76.37 53.69 42.38
Table A3: Human accuracy with explanation methods on easy (E), medium (H), and hard (H) images of the controlled distributions.

Appendix A4 Are attribution maps more effective than 3-NN in improving human accuracy on correct/wrong images of the controlled distribution?

Explanations increase the chance humans agree with AI’s predictions in sentiment classification and question answering Bansal et al. (2020a). While we examined the human accuracy on correct and images of AI, it has been shown that explanations also encouraged humans to accept AI’s predictions rather than reject in image classification.

In Table A4, regarding ImageNet images, 3-NN was the most useful method for AI error identification, at 63.33%. Besides, the explanations from 3-NN and GradCAM seemed to benefit equally to humans when AI classify correctly (89.28% and 89.14%, respectively).

EP and 3-NN did the best in explaining network’s correct predictions on dog images (79.03% and 79.56%). We were surprised because all methods did not outperform the random choice approach (50%) to identify misclassifications of ResNet-34 on Dogs images.

ImageNet Dogs
Correct Wrong Correct Wrong
Confidence 85.62 60.80 75.14 41.71
GradCAM 89.14 59.38 77.85 39.47
EP 86.67 60.31 79.03 32.48
SOD 83.77 60.42 71.17 46.18
3-NN 89.28 63.33 79.56 33.01
Table A4: Human accuracy with explanation methods on correct and wrong images of the controlled distributions.

Appendix A5 How does prior knowledge affect human accuracy?

Although we tried to mitigate the effect of prior knowledge by giving users the definition and sample images of categories, we still would like to examine how differently users score when never-seen-before vs. already-known objects are presented.

We let denote the gap between accuracy on known images vs. accuracy on unknown images within a method. Looking into Table A5, SOD was affected the most by prior knowledge compared to other explanation methods with of 6.38% in ImageNet or 2.21% in Dogs. An explanation for the above observation is that because SOD is model-agnostic, it provides the least classifier-relevant information, and therefore humans would have to rely on the prior knowledge the most to make decisions.

ImageNet Dogs
Known Unknown Known Unknown
Accuracy Trials Accuracy Trials Accuracy Trials Accuracy Trials
Confidence 73.07 724 69.89 176 3.18 61.88 446 61.59 604 0.29
GradCAM 72.69 714 72.22 216 0.47 60.78 459 60.32 441 0.46
EP 74.11 734 73.01 226 1.10 55.39 334 57.38 596 -1.84
SOD 73.37 811 66.99 209 6.38 60.52 468 62.73 492 -2.21
3-NN 76.46 838 74.32 182 2.14 56.35 367 57.75 563 -1.40
Table A5: Human accuracy on known and unknown images of both natural and adversarial images.

Appendix A6 AI-only thresholds

Using confidence score to automate the tasks, we found the two optimal thresholds are 0.55 and 0.50 for ImageNet and Dogs, respectively (see Table A6). These threshold values were tuned on around 50K ImageNet and 6K Stanford Dogs images.

0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95
ImageNet 73.40 73.74 74.47 75.47 76.66 77.75 78.66 79.44 80.11 80.56 80.79 80.59 79.93 79.01 77.85 76.31 74.12 70.65 64.83
Dogs 77.37 77.44 77.81 78.23 78.81 79.31 79.80 80.65 80.65 81.14 81.07 80.34 79.34 77.67 75.82 73.63 70.02 65.21 55.93
Table A6: Accuracy of AI-only on ImageNet and Dogs at different threshold values.

Appendix A7 Original distributions of ImageNet and Stanford Dogs

Fig. A2

shows the original distributions of ImageNet and Dogs by confidence intervals. Although we skipped images having confidence in [0.3, 0.4) and [0.6, 0.8), the human accuracy on the remaining intervals can still represent that of the original distributions because the numbers of images in those intervals are not major.

(a)
(b)
Figure A2: Image distribution of (a) ImageNet and (b) Dogs by confidence score intervals.

Table A7 shows the distributions of easy, medium, and hard images of ImageNet and Dogs dataset. It should be noted that we ignored a few confidence intervals so the total numbers of images are not 50K and 6K.

Table A7: The real distributions of easy, medium, and hard images in ImageNet and Dogs dataset.
ImageNet Dogs
Images Percentage (%) Images Percentage (%)
E 28967 75.19 3364 74.13
M 6648 17.26 916 20.19
H 2908 7.55 258 5.68
38523 100 4538 100

Appendix A8 Participants’ statistics

Table A8 shows the numbers of users, trials, and users per image in ImageNet and Dogs experiment for each method.

ImageNet Dogs
Users Trials Users per image Users Trials Users per image
Confidence 30 900 2.00 35 1050 2.33
GradCAM 31 930 2.07 30 900 2.00
EP 32 960 2.13 31 930 2.07
SOD 34 1020 2.27 32 960 2.13
3-NN 34 1020 2.27 31 930 2.07
Total 161 4830 159 4770
Table A8: The number of users, trials, and users per image in ImageNet and Dogs experiment. A trial is one time we show an image to the user and asks for his decision.

Appendix A9 Participants’ acceptance-rejection rate

Table A9 shows the ratios of acceptance and rejection across methods in ImageNet and Dogs (natural examples only). Consistent with Bansal et al. (2020a), we found that explanations increase the chance humans accept AI’s predictions except SOD. As mentioned in Sec. A13.5, SOD users were most likely to reject AI’s labels because this baseline gave users bad-quality heatmaps.

ImageNet Dogs
Accept Reject Accept Reject
Confidence 62.33 37.67 66.67 33.33
GradCAM 64.26 35.74 69.10 30.90
EP 62.99 37.01 73.27 26.73
SOD 61.97 38.03 62.66 37.34
3-NN 63.56 36.44 73.40 26.60
Table A9: The percentages of acceptance of rejection in ImageNet and Dogs experiment.

Appendix A10 Sample experiment screens

Here we show our experiment user interface by screens.

a10.1 Instructions

We introduced Sam - the AI and explained the tasks to participants. Later, we explicitly restricted the device used for the experiments to ensure the display resolution will not affect human decisions.

Figure A3: Instruction screen of experiments.

a10.2 Training

We gave users 5 trials where each of them is followed by a feedback screen on users’ decision. For GradCAM, EP, and SOD, as their heatmaps have the similar format, the same interpretation was provided as shown in Fig. 3(a). 3-NN displays to users three correct images of the predicted class in Fig. 3(b), while Confidence only shows the input image, the predicted label, and the confidence score.

(a)
(b)
Figure A4: The training screens for (a) heatmaps (GradCAM, EP, and SOD) and (b) nearest neighbors with annotated components.

a10.3 Validation and Test

While we have 10 trials in evaluation and 30 trials in test, we did not tell participants about the validation phase to avoid overfitting. These 40 trials were showed continuously in which the definition and sample images of the predicted class were given beforehand as in Fig. A5. No feedback was given in validation and test.

(a)
(b)
Figure A5: Users are given the definition and three sample images of the predicted classes (a). Users are asked to agree or disagree with the prediction of AI using explanation(b).

As the purpose of validation was to filter out users not paying enough attention to the experiments (e.g. random clicking), we carefully chose clearly wrong and correct images by ResNet-34 (Fig. A6) to check users’ attention. It should be noted that the definition and sample images of the predicted category were given in advance.

(a) ImageNet
(b) Stanford Dogs
Figure A6: Validation images for (a) ImageNet and (b) Dogs. There are 5 clearly correct and 5 clearly wrong predictions of AI in each experiment. Above each image is the groundtruth label and the misclassified label (if the image was misclassified by AI). In Stanford Dogs, the misclassified images were synthetically generated using PGD- adversarial attacks.

Appendix A11 Images mislabeled into non-dog categories (MIND samples)

In Dogs experiment, images misclassified into non-dog categories (MIND) were discarded because users often can instantly reject those images without explanation. As shown in Fig. A7, almost all MIND samples contain more than one object.

(a)
(b)
(c)
(d)
Figure A7: Mislabeled Into Non-Dog samples (MINDs). There are 71 MINDs in Dogs by ResNet-34. Above each image is the groundtruth label and the misclassified label.

Appendix A12 Example explanations displayed to users

Below are example of explanations taken from the screens displayed to users during our user-study. Here, we show the differences between GradCAM, EP, SOD, and 3-NN explanations for the same input image predicted as “american coot” (Fig. A8).

(a) GradCAM explanation
(b) EP explanation
(c) SOD explanation
(d) 3-NN explanation
Figure A8: The explanations from GradCAM, EP, SOD, and 3-NN for the input image labeled “american coot” by the classifier. While the highlight from GradCAM tends to be expansive, the focus of EP is narrow, and SOD attend to the entire body of the bird. 3-NN presents similar scenes of a coot around the pond.

Appendix A13 Qualitative examples supporting our findings

a13.1 Hard, real ImageNet images that were correctly labeled by 3-NN users but not GradCAM, EP, or SOD users

Regarding hard images, we observed that images corrected by 3-NN but not attribution maps and SOD often contain multiple concepts (Fig. A11), low quality (Fig. A9), look-alike objects (Fig. A10 and A14), only a part of the main object (Fig. A12), or objects with unusual appearances (Fig. A13). On these images, while heatmaps did not highlight the discriminative features and the confidence score was low, users tended to reject AI’s labels. In contrast, 3-NN helped users gain confidence that AI is correct when there are multiple plausible labels.

Figure A9: Hard ImageNet image with low quality. 3-NN might help users recognize the shape and color of “african hunting dog”, while attribution methods gave users little information because users could not see what AI is looking at clearly.
Figure A10:

Hard ImageNet image with probable look-alike objects. 3-NN even found the same man, which strongly supports the AI prediction, while EP and SOD highlighted incorrectly.

Figure A11: Hard ImageNet image with multiple plausible concepts present. The last image of 3-NN (right most) showed an image of a car labeled as “car wheel”, which strongly helps users confirm the prediction of AI. However, users with other explanation methods rejected the prediction because AI did not highlight properly or showed low confidence.
Figure A12: Hard ImageNet image with only a part of the main object. 3-NN helped users recognize a “jean” leg, but other heatmaps could not find the determinative features on the input image.
Figure A13: Hard ImageNet image with strange-looking objects. Although the “ladle” in the input image looks strange, 3-NN showed other ladles also have unusual appearance (the last neighbor image). Users might instantly rejected the prediction because the three sample images are contrastive.
Figure A14: Hard ImageNet image with probable look-alike objects.3-NN helped users gain confidence while “overskirt”, “hoopskirt”, and “gown” look very similar.

a13.2 Medium, AI-misclassified, real ImageNet images that were incorrectly accepted by 3-NN users

These images can be divided into two main categories: debatable ground-truth (Figs. A15 & A16) and look-alike object (Figs. A17 & A18).

Figure A15: Medium ImageNet image with wrong labels. The input image and the first NN are clearly the same but the annotated label is “missile”.
Figure A16: Medium ImageNet image with multiple objects present. The spider is salient and 3-NN retrieved very similar images, but the ground truth is “spider web”.
Figure A17: Medium ImageNet image with fine-grained classes. 3-NN failed to show the difference between “cardigan” and “pembroke” to users.
Figure A18: Medium ImageNet image with confusing objects. 3-NN failed to show the difference between “palace” and “monastery” to users.

a13.3 Easy, AI-misclassified, ImageNet images that were correctly rejected by 3-NN users but not GradCAM and SOD users

3-NN helped users distinguish the two classes by showing contrastive examples (e.g. walking stick vs. african chameleon in Fig. A19 or horse cart vs. grocery store in Fig. A20).

Figure A19: Easy ImageNet image which was clearly misclassified by the AI. 3-NN easily pointed out the difference between “african chameleon” and “walking stick” to users.
Figure A20: Easy ImageNet image which was clearly misclassified by the AI. 3-NN easily pointed out the difference between “grocery store” and “horse cart” to users.

a13.4 Adversarial ImageNet images that were correctly rejected by 3-NN users but not GradCAM, EP, or SOD users

In Fig. 2, what made 3-NN more effective than attribution maps on Adversarial ImageNet?

As the adversarial attacks fooled AI by small perturbations, the misclassified labels are not far from the ground truth (e.g. bee eater to lorikeet in Fig. A22 or collie to shetland sheepdog in Fig. A26). The highlights of heatmaps focused on parts of the main object and made the explanations compelling to users. 3-NN helped users differentiate the two categories by looking at the constrastive images of the predicted label and the ground truth (Fig. A21 and Fig. A22).

Figure A21: Adversarial ImageNet image of “barracouta” which was misclassified to “tench”. Users may use the difference in skin patterns of “tench” and “barracouta” to make decision.
Figure A22: Adversarial ImageNet image of “bee eater” which was misclassified to “lorikeet”. 3-NN contrasted these two bird breeds strongly.

a13.5 AI-misclassified Dogs images that are correctly rejected by SOD users but not GradCAM, EP, or 3-NN users

In Table A1, what made SOD significantly more effective than other methods on correcting AI-misclassified images of Dogs?

We found that GradCAM and EP often highlighted the entire face of the dogs, which made the heatmaps persuasive to users although the predictions were wrong. Regarding 3-NN, the misclassified category is visually similar to the ground truth (i.e. eskimo dog vs. malamute), which was challenging for lay users to distinguish. We assume that users expected explanations to be as specific and relevant as possible because the differences among breeds are minimal. SOD highlighted the entire body of the dogs (Fig. A23) or even irrelevant areas (Fig. A24). This explains why users with SOD tended to ignore rather than trust the AI. While 3-NN users leveraged the information of nearest neighbors to identify AI’s errors, SOD users rejected predictions because of the heatmaps’ low quality, which unintentionally improved the accuracy on wrong Dogs images. The rejection rates of SOD users were highest in both ImageNet (38.03%) and Dogs (37.34%) as shown in Table A9. Indeed, due to the high rejection rate of SOD users, the accuracy on correct Dogs images was lowest as shown in Table A1.

Figure A23: Wrong Dogs image of “eskimo dog” which was misclassified to “malamute”. GradCAM and EP often highlighted the entire face of the dogs, which makes the heatmaps persuasive to users although the predictions are wrong. For 3-NN, the mislabeled category is visually similar to the ground truth (e.g. eskimo dog vs. malamute), which is challenging for users to distinguish, then they inclined to accept the predictions. SOD always highlights the entire body of the dogs, explaining why users with SOD tended to ignore rather than trust the AI.
Figure A24: Wrong Dogs image of “flat-coated retriever” which was misclassified to “newfoundland”. SOD highlighted irrelevant areas, so users with SOD were more likely to reject.

a13.6 Adversarial Stanford Dogs images that were correctly rejected by only Confidence users but not GradCAM, EP, and 3-NN users

In Fig. 2, why did visual explanations hurt human-AI team performance Adversarial Dogs (the hardest task)?

We found no explanation methods that benefit participants in this task. While GradCAM and EP mostly concentrated on a body part of dogs, 3-NN showed images of an almost identical breed (Fig. A25 and A26). Again, the improvement of SOD came from the bad quality of its heatmap.

Figure A25: Adversarial Dogs image of “cocker spaniel” which was misclassified to “bedlington terrier”. GradCAM and EP concentrated on the belly of dogs, and 3-NN showed images of an almost identical breed, making users trust the prediction.
Figure A26: Adversarial Dogs image of “collie” which was misclassified to “shetland sheepdog”. GradCAM and EP concentrated on the face of dogs, and 3-NN showed images of a very similar dog breed, making users trust the prediction.