Why did a computer vision system suspect that a person had breast cancerWu et al. (2019), or was an US capitol rioter fbi , or a shoplifter law ? The explanations for such high-stake predictions made by existing Artificial Intelligence (AI) agents can impact human lives in various aspects, from social Doshi-Velez & Kim (2017), to scientific Nguyen et al. (2015), and legal law ; Goodman & Flaxman (2017); Doshi-Velez et al. (2017)
. A common medium for explaining an image classifier’s decisions is anattribution map (AM) Bansal et al. (2020b) i.e. a heatmap that highlights the input pixels that are important for or against a predicted label. Attribution maps, a.k.a. “saliency maps”, can be useful in localizing malignant tumors in x-ray images Rajpurkar et al. (2017) or detecting biases in image classifiers Lapuschkin et al. (2016).
Since 2013 Simonyan et al. (2013), hundreds of research papers have either used attribution methods or proposed new ones Covert et al. (2020); Das & Rad (2020). Yet, it remains largely unknown how effective state-of-the-art AMs are in improving the performance of human-AI team on computer vision tasks. Given that humans are the target users of explanations, answering this question is critical for the community to produce useful methods. However, most attribution methods were often only evaluated on proxy automatic-evaluation metrics such as pointing game Zhang et al. (2018), weakly-supervised localization Zhou et al. (2016), or deletion Petsiuk et al. (2018), which may not necessarily correlate with human-AI team performance on a downstream task.
In this paper, we conducted the first, large-scale user study to shed light on the effectiveness of AMs in assisting humans in single-label, image classification, which is the task that most attribution methods were designed for. We asked 320 lay and 11 expert users to decide whether machine decisions are correct after observing an input image, top-1 classification outputs, and explanations (Fig. 1). We ran experiments on both real as well as adversarial images Szegedy et al. (2013) and on both coarse 1000-class and fine-grained 120-class image classification tasks, i.e. ImageNet Russakovsky et al. (2015) and Stanford Dogs Khosla et al. (2011), respectively. Our main findings include:
AMs are, surprisingly, not more effective than nearest-neighbors (here, 3-NN) in improving human-AI team performance on both ImageNet and fine-grained dog classification (Sec. 3.1).
On fine-grained dog classification, a harder task for a human-AI team than 1000-class ImageNet, presenting AMs to humans interestingly does not help but instead hurts the performance of human-AI teams, compared to AI alone without humans (Fig. 2b).
On adversarial ImageNet images, 3-NN is more effective than all tested attribution methods in helping humans to reject these incorrect AI predictions (Sec. 3.1).
On adversarial Stanford Dog images which are very hard for humans to label, presenting confidence scores only to humans is the most effective method compared to all other explanations tested including AMs and 3-NN examples (Sec. 3.1).
2.1 Image classification tasks
To evaluate human-AI team performance on image classification, for each image, we also presented to a human the following items: (1) an AI’s top-1 predicted label, (2) its confidence score, and, optionally (3) a visual explanation that is generated in attempt to explain the predicted label. The user was then asked to decide whether AI’s predicted label is correct or not (see Fig. 1). We performed our experiments on two image datasets of ImageNet and Stanford Dogs, which have varying difficulty.
We tested human-AI teams on ImageNet Russakovsky et al. (2015), which is an image classification task that most attribution methods were tested on or designed for Zhou et al. (2016); Selvaraju et al. (2017); Agarwal & Nguyen (2020); Petsiuk et al. (2018). ImageNet is a 1.3M-image dataset spanning across 1000 diverse categories from natural to man-made entities Russakovsky et al. (2015).
To test whether our findings generalize to a fine-grained classification task that is harder to human-AI teams than ImageNet, we repeated the experiments on the 120-class Stanford Dogs Khosla et al. (2011) dataset, a subset of ImageNet. This dataset is challenging due to the similarity between dog species and the large intra-class variation Khosla et al. (2011). Compared to ImageNet, Stanford Dogs is expected to be harder to human-AI teams because lay users often have significantly less prior knowledge about fine-grained dog breeds compared to a wide variety of everyday objects in ImageNet.
2.2 Image classifiers
We took ResNet-34 He et al. (2016) pretrained on ImageNet (73.31% top-1 accuracy) from torchvision Marcel & Rodriguez (2010) as the target classifier (i.e., the AI) for both ImageNet and Stanford Dogs classification tasks because the 1000-class ImageNet that the model was pretrained on includes 120 Stanford Dogs classes. The visual explanations in this paper were generated to explain this model’s predicted labels. We chose ResNet-34 because ResNets were widely used in feature attribution research Selvaraju et al. (2017); Fong et al. (2019); Petsiuk et al. (2018); Agarwal & Nguyen (2020); Lu et al. (2020).
2.3 Visual explanation methods
To understand the causal effect of adding humans in the loop, we compared the performance of a standard AI-only system (i.e. no humans involved) and human-AI teams. In each team, besides an input image and a corresponding top-1 label, humans are also presented with the corresponding confidence score from the classifier, and a visual explanation (i.e. one heatmap or three 3-NN images in; see examples from our user-study in Sec. A12). In total, we evaluated the following six methods.
A common way to automating the process of accepting or rejecting a predicted label is via confidence-score thresholding Bendale & Boult (2016). That is, a top-1 predicted label is accepted if its associated confidence score is , a threshold. We found the optimal confidence threshold that produces the highest accepting accuracy on the entire validation set by sweeping across threshold values i.e. at a 0.05 increment. The optimal values are 0.55 for ImageNet and 0.50 for Stanford Dogs (details in Sec. A6).
Confidence scores only
To understand the impact of visual explanations on human decisions, we also used a baseline where users are asked to make decisions given no explanations (i.e. given only the input image, its top-1 predicted label and confidence score). To our best knowledge, this baseline has not been studied in computer vision, but has been shown useful, in other domains, for improving human-AI team accuracy or user’s trust Zhang et al. (2020); Bansal et al. (2020a).
GradCAM and Extremal Perturbation (EP)
We chose GradCAM Selvaraju et al. (2017) and Extremal Perturbation (EP) Fong et al. (2019) as two representatives for state-of-the-art attribution methods (Fig. 1). Representing for the class of white-box, gradient-based methods Zhou et al. (2016); Chattopadhay et al. (2018); Rebuffi et al. (2020), GradCAM relies on gradients and the activation map at the last conv layer of ResNet-34 to compute a heatmap. GradCAM passed a weight-randomization sanity check Adebayo et al. (2018) and often obtained competitive scores in proxy evaluation metrics (see Table 1 in Fong et al. (2019) and Table 1&2 in Elliott et al. (2019)).
% of input pixels that maximizes the target-class confidence score. We followed the authors’ best hyperparameters—summing over four binary masks generated usingand a Gaussian smoothing kernel with an std equal to 9% of the shorter side of the image (see the code EP2 ). We used the TorchRay package tor to generate GradCAM and EP attribution maps.
Salient object detection (SOD)
To assess the need for heatmaps to explain a specific classifer’s decisions, we also considered a classifier-agnostic heatmap baseline. That is, we used a pre-trained state-of-the-art salient-object detection (SOD) method called PoolNet Liu et al. (2019a), which uses a ResNet-50 backbone pre-trained on ImageNet. PoolNet was trained to output a heatmap that highlights the salient object in an input image (Fig. 3)— a process that does not take our image classifier into account. Thus, SOD serves as a strong saliency baseline for GradCAM and EP.
To further understand the pros and cons of AMs, we compared them to a representative, prototype-based explanation method (e.g. Nguyen et al. (2019); Chen et al. (2019); Nauta et al. (2020)). That is, for a given input image and a predicted label (e.g., “lorikeet”), we show the top-3 nearest images to by retrieving them from the same ImageNet training-set class (Fig. 1). To compute the distance between two images, we used the distance in the feature space of the last conv layer of ResNet-34 classifier (i.e.
per PyTorch definitionres or in He et al. He et al. (2016)). We chose as our pilot study found it to be the most effective among the four main conv layers (i.e. to ) of ResNet-34.
-NN has a wide spectrum of applications in machine learning and computer visionShakhnarovich et al. (2005), the effectiveness of human-AI collaboration using prototype-based explanations has been rarely studied Rudin et al. (2021). To the best of our knowledge, we provided the first user study that evaluates the effectiveness of prototype-based explanations (here, 3-NN) on human-AI team performance.
2.4 User-study design
Our user-study experiments were designed and hosted on Gorilla Anwyl-Irvine et al. (2020). We recruited lay participants via Prolific Palan & Schitter (2018) (at $10.2/hr), which is known for a high-quality user base Peer et al. (2017). Each Prolific participant self-identified that English is their first language, which is the only demographic filter we used to select users on Prolific. Over the course of two small pilot studies and one main, large-scale study, we collected in total over 466 complete submissions (each per user) after discarding incomplete ones.
In the main study, after filtering out submissions by validation scores (described in Sec. 2.4.2), we had 161 and 159 qualified submissions for our ImageNet and Dogs experiments, respectively. Each of the 5 methods (i.e. except for AI-only) was experimented by at least 30 users, and each (image, explanation) pair was seen by at least two users (statistics of users and trials in Sec. A8).
Our Gorilla study contains three main sets of screens: (1) Introduction—where each user is introduced to the study and relevant rules (details in Sec. A10); (2) Training; and (3) Test. Each user is randomly assigned a set of test images and only one explanation method (e.g. GradCAM heatmaps) to work with during the entire study.
To familiarize with the image classification task (either ImageNet or Stanford Dogs), users were given five practice questions. After answering each question, they were shown feedback with groundtruth answers. In each training screen, we also described each component in a screen via annotations (see example screens in Sec. A10.2).
Validation and Test
After training, each user was asked to answer 40 Yes/No questions in total. Before each question, a user was provided with a short WordNet Miller (1995) definition of the predicted label and three random training-set images corrected-classified into the predicted class with a confidence score (see an example in Fig. A5). To control the quality of user responses, out of 40 trials, we used 10 trials as validation cases where we carefully chose the input images such that we expected participants who followed our instructions to answer correctly (details in Sec. A10.3). We excluded those submissions that had below 10/10 and 8/10 accuracy on the 10 validation trials from the ImageNet and Dogs experiments, respectively. For the remaining (i.e., qualified) submissions, we used the results of their 30 non-validation trials in our study.
We wish to understand the effectiveness of visual explanations when AIs are correct vs. wrong, and when AIs face real vs. adversarial examples Szegedy et al. (2013). For both ImageNet and Dogs experiments, we used the ResNet-34 classifier to sample three types of images: (1) correctly-classified, real images; (2) misclassified, real images; and (3) adversarial images (i.e. also misclassified). In total, we used 3 types 150 images = 450 images per dataset. Each image was then used to generate model predictions and explanations for comparing the 6 methods described in Sec. 2.3.
Filtering From the 50K-image ImageNet validation set, we sampled images for both ImageNet and Dogs experiments. To minimize the impact of low-quality images to users’ performance, we removed all 900 grayscale images and 897 images that have either width or height smaller than 224 px, leaving 48,203 and 5,881 images available for use in our ImageNet and Dogs experiments, respectively. For Dogs, we further excluded all 71 dog images mislabeled by ResNet-34 into non-dog categories (examples in Sec. A11) because they can trivialize explanations, yielding 5,810 Dogs images available for sampling.
Sampling natural images To understand human-AI team performance at varying levels of difficulty for humans, we randomly selected images from the pool of filtered, real images into three sets: Easy (E), Medium (M), and Hard (H).
Hard images are those correctly labeled by the classifier with a low confidence score (i.e., ) and mislabeled with high confidence (i.e., ). Vice versa, the Easy set contains those correctly labeled with high confidence and mislabeled with low confidence. The Medium set contains both correctly and incorrectly labeled images with a confidence score i.e. when the AI is unsure (see confidence-score distributions in Sec. A7). In each set (E/M/H), we sampled 50 images correctly-labeled and 50 mislabeled by the model. In sum, per dataset, there are 300 natural images divided evenly into 6 controlled bins (see Fig. A1 for the ratios of these bins in the original datasets).
Generating adversarial images After the filtering above, we took the remaining real images to generate adversarial examples via Foolbox Rauber et al. (2017) for the ResNet-34 classifier using the Project Gradient Descent (PGD) framework Madry et al. (2017) with an bound for 40 steps, each of size of . We chose this setup because at weaker attack settings, most adversarial images became correctly classified after being saved as a JPEG file Liu et al. (2019b), defeating the purpose of causing AIs to misbehave. Here, for each dataset, we randomly sampled 150 adversarial examples (e.g., the input image in Fig. 1) that are misclassified at the time presented to users in the JPEG format and often contain so small artifacts that we assume to not bias human decisions. Following the natural-image sampling, we also divided the 150 adversarial images into three sets (E/M/H), each containing 50 images.
2.5 Automatic evaluation metrics for attribution maps
Many attribution methods have been published; however, most AMs were not tested on end-users but instead only assessed via proxy evaluation metrics. We aim to measure the correlation between three common metrics—Pointing Game Zhang et al. (2018), Intersection over Union (IoU) Zhou et al. (2016), and weakly-supervised localization (WSL) Zhou et al. (2016)—with the actual human-AI team performance in our user study.
All three metrics are based on the assumption that an AM for explaining a predicted label should highlight an image region that overlaps with a human-annotated bounding box (BB) for that category . For each of 300 real, correctly-classified images from each dataset (described in Sec. 2.4.3), we obtained its human-annotated BB from ILSVRC 2012 Russakovsky et al. (2015) for using in the three metrics.
Pointing game Zhang et al. (2018) is a common metrics often reported in the literature (e.g. Selvaraju et al. (2017); Fong et al. (2019); Rebuffi et al. (2020); Du et al. (2018); Petsiuk et al. (2018); Wang et al. (2020)). For an input image, a generated attribution heatmap is considered a correct explanation (i.e. a hit) if its highest-intensity point lies inside the human-annotated BB. Otherwise, it is a miss. The accuracy for an explanation method is computed by averaging over all images. We used the TorchRay implementation of Pointing Game tor and its default hyperparameters (tolerance = 15).
Intersection over Union The idea is to compute the agreement in Intersection over Union (IoU) Zhou et al. (2016)
between a human-annotated BB and a binarized attribution heatmap. A heatmap was binarized at the method’s optimal threshold, which was found by sweeping across values .
Weakly-supervised localization (WSL) is based on IoU scores Zhou et al. (2016). WSL counts a heatmap correct if its binarized version has a BB that overlaps with the human-labeled BB at an IoU . WSL is also commonly used e.g. Agarwal & Nguyen (2020); Selvaraju et al. (2017); Du et al. (2018). For both GradCAM and EP, we found the binarization threshold corresponds to their best WSL scores on 150 correctly-classified images (i.e., excluding mislabeled images because human-annotated BBs are for the groundtruth labels).
3.1 On natural images, how effective are attribution maps in human-AI team image classification?
We wish to understand the effectiveness of attribution maps (GradCAM and EP) compared to four baselines (AI-only, Confidence, SOD, and 3-NN) in image classification by human-AI teams. We compared 6 methods on 2 natural-image sets: ImageNet and Stanford Dogs.
Because a given image can be mapped to one of the 6 controlled bins as described in Sec. 2.4.3, for each of the 6 methods, we computed its human-AI team accuracy for each of the 6 bins where each bin has exactly 50 images (Sec. A1
reports per-bin accuracy scores). To estimate the overall accuracy of a method on theoriginal ImageNet and Dogs dataset, we computed the weighted sum of its per-bin accuracy scores where the weights are the frequencies of images appearing in each bin in practice. For example, 70.93% of the Dogs images are correctly-classified with a high confidence score (Fig. A1; Easy Correct).
ImageNet results On ImageNet, human-AI teams where humans use heatmaps (GradCAM, EP, or SOD) and confidence scores outperformed AI-only by an absolute gain of 6–8% in accuracy (Fig. 2a; 80.79% vs. 88.77%). That is, when teaming up, humans and AI together can achieve a better performance than AI alone. However, only half of such improvement can be attributed to the heatmap explanations (Fig. 2a; 84.79% vs. 88.77%). That is, users when presented with the input image and the classifier’s top-1 label and confidence score already obtained 4% boost over AI alone (84.79% vs. 80.79%).
Dogs results Interestingly, the trend did not carry over to fine-grained dog classification. On average, humans when presented with (1) confidence scores only or (2) confidence scores with heatmaps all underperformed the accuracy of AI-only (Fig. 2b; 81.14% vs. 76.45%). An explanation is that ImageNet contains 50% of man-made entities, which contain many everyday objects that users are familiar with. Therefore, the human-AI teaming led to a substantial boost on ImageNet. In contrast, most lay users are not dog experts and therefore do not have the prior knowledge necessary to help them in dog identification, resulting in even worse performance than a trained AI (Fig. 2b; 81.14% vs. 76.45%). Interestingly, when providing users with nearest neighbors, human-AI teams with 3-NN outperformed all other methods (Fig. 2b; 82.88%) including the AI-only.
Both datasets On both ImageNet and Dog distributions, 3-NN is among the most effective. On average over 6 controlled bins, 3-NN also outperformed all other methods by a clear margin of 2.71% (Sec. A2 & A4). Interestingly, SOD users tend to reject more often than other users (Sec. A9), inadvertently causing a high human-AI team accuracy on AI-misclassified images (Sec. A13.5).
3.2 On adversarial images, how effective are attribution maps in human-AI team image classification?
As adversarial examples are posing a big threat to AI-only systems Papernot et al. (2016), here, we are interested in testing how effective explanations are in improving human-AI team performance over AI-only, which is assumed to be 0% accurate under adversarial attacks. That is, on ImageNet and Dogs, we compare the accuracy of human-AI teams on 150 adversarial examples, which all caused the classifier to misclassify. Note that a user was given a random mix of natural and adversarial test images and was not informed whether an input image is adversarial or not.
Adversarial ImageNet On both natural and adversarial ImageNet, AMs are on-par or more effective than showing confidence scores alone (Fig. 2c). Furthermore, the effect of 3-NN is a consistent +4% gain compared to Confidence (Fig. 2a & c; Confidence vs. 3-NN). Aligned with the results on natural ImageNet and Dogs, 3-NN remains the best method on adversarial ImageNet (Fig. 2c; 75.07%). See Sec. A13.4 for adversarial images that were only correctly-rejected by 3-NN users but not others.
Adversarial Dogs Interestingly, on adversarial Dogs, adding an explanation (either 3-NN or a heatmap) tend to cause users to agree with the model’s incorrect decisions (Fig. 2d). That is, heatmaps often only highlighted a coarse body region of a dog without pinpointing an exact feature that might be explanatory. Similarly, 3-NN often shows examples of an almost identical breed to the groundtruth (e.g. mountain dog vs. Gordon setter), which is hard for lay-users to tell apart (see qualitative examples in Sec. A13.6).
3.3 On ImageNet, why is 3-NN more effective than attribution maps?
Analyzing the breakdown accuracy of 3-NN in each controlled set of Easy, Medium, and Hard, we found 3-NN to be the most effective method in the Easy and Hard categories of ImageNet (Table A3).
Easy When AI mislabels with low confidence, 3-NN often presents contrast evidence showing that the predicted label is incorrect, i.e. the nearest examples from the predicted class is distinct from the input image (Fig. 1; lorikeets have a distinct blue-orange pattern not found in bee eaters). The same explanation is our leading hypothesis for 3-NN’s effectiveness on adversarial Easy, ImageNet images. More examples in Sec. A13.3.
Hard Hard images often contain occlusions (Figs. A9 & A12), unusual instances of common objects (Fig. A13), or could reasonably be in multiple categories (Figs. A11 & A10). When the classifier is correct but with low confidence, we found 3-NN to be helpful in providing extra evidence for users to confirm AI’s decisions (Fig. 3). In contrast, heatmaps often only highlight the main object regardless of whether AI is correct or not (Fig. 1; GradCAM). More examples are in Sec. A13.1.
Medium Interestingly, 3-NN was the best method on the Easy and Hard set, but not on the Medium (Table A3). Upon a closer look, we surprisingly found 63% of misclassified images by 3-NN human-AI teams to have debatable groundtruth ImageNet labels (see Sec. A13.2 for failure cases of 3-NN).
See full-res images in Figs. A9 & A10.
3.4 How do automatic evaluation scores correlate with human-AI team performance?
IoU Zhou et al. (2016), WSL Zhou et al. (2016), and Pointing Game Zhang et al. (2018) are three attribution-map evaluation metrics commonly used in the literature, e.g. in Petsiuk et al. (2018); Fong et al. (2019); Fong & Vedaldi (2017); Agarwal & Nguyen (2020); Selvaraju et al. (2017). Here, we measure the correlation between these scores and the actual human-AI team accuracy to assess how high-performance on such benchmarks translate into the real performance on downstream tasks.
Experiment We took the EP and GradCAM heatmaps of the 150 real images that were correctly-classified (see Sec. 2.4.3) from each dataset and computed their IoU, WSL, and Pointing Game scores111Note that WSL and Pointing Game scores are binary while IoU scores are real-valued. (see Sec. 2.5). We computed the Pearson correlation between these scores and the human-AI team accuracy obtained in the previous human-study (Sec. 2.4).
Results While EP and GradCAM are state-of-the-art methods under Pointing Game Fong et al. (2019) or WSL Fong & Vedaldi (2017), we surprisingly found the accuracy of users when using these AMs to correlate poorly with the IoU, WSL, and Pointing Game scores of heatmaps. Only in the case of GradCAM heatmaps for ImageNet (Fig. 3(a)), the evaluation metrics showed a small positive correlation with human-AI team accuracy ( for IoU; for WSL; and for Pointing Game). In all other cases, i.e. GradCAM on Dogs; EP on ImageNet and Dogs (Fig. 3(b)–c), the correlation is negligible (). That is, the pointing or localization accuracy of feature attribution methods do not necessarily reflect their explanation effectiveness in helping users making correct decisions in downstream tasks.
3.5 Machine learning experts found Nearest-Neighbors more effective than GradCAM
We have found in our previous study on lay-users that AMs can be useful to human-AI teams, but not more than a simple 3-NN in most cases (Sec. 3.1–Sec. 3.3). Here, we aim to evaluate whether such conclusions carry over to the case of machine learning (ML) expert-users who are familiar with feature attribution methods, nearest-neighbors, and ImageNet. Because ML experts have a deeper understanding into these algorithms than lay-users, our expert-study aims to measure the utility of these two visual explanation techniques to ML researchers and practitioners.
As GradCAM was the best attribution method in the lay-user ImageNet study (Fig. 2), we chose GradCAM as the representative for attribution maps and compare it with 3-NN on human-AI team image classification on ImageNet.
Experiment We repeated the same the lay-user study but on a small set of 11 expert users who are Ph.D. students and postdoctoral researchers in the field of ML and computer vision and from a range of academic institutions. We recruited five GradCAM users and six 3-NN users who are very familiar with feature attribution and nearest-neighbor algorithms, respectively. Similar to the lay-user study, each expert is presented with a set of randomly-chosen 30 images which include both natural and adversarial images.
Results The experts working with 3-NN performed substantially better than the GradCAM experts (Table 1; mean accuracy of 76.67% vs. 68.00%). 3-NN is consistently more effective on both natural and adversarial image subsets. Interestingly, the performance of GradCAM users also vary 3
more (standard deviation of 8.69% vs. 2.98% in accuracy). Aligned with the lay-user study, here, we found 3-NN to be more effective than AMs in improving human-AI team performance where the users are domain experts familiar with the mechanics of how explanations were generated.
4 Related Work
Evaluating confidence scores on humans
AI confidence scores have been found to improve user’s trust on AI’s decisions Zhang et al. (2020) and be effective in human-AI team prediction accuracy on several NLP tasks Bansal et al. (2020a). In this work, we do not measure user trust but only the complementary human-AI team performance. To the best of our knowledge, our work is the first to perform such human-evaluation of AI confidence scores for image classification.
Evaluating explanations on humans In NLP, using attribution methods as a word-highlighter has been found to improve user performance in question answering Feng & Boyd-Graber (2019). Bansal et al. Bansal et al. (2020a) found that such human-AI team performance improvement was because the explanations tend to make users more likely to accept AI’s decisions regardless of whether the AI is correct or not. We found consistent results that users with explanations tend to accept AI’s predicted labels more (see Sec. A9) with the exception of SOD users who reject more.
In image classification, Chu et al. Chu et al. (2020) found that presenting Integrated-Gradient heatmaps to users did not significantly improve the human-accuracy on predicting age from facial photos. Different from Chu et al. (2020), our study tested multiple attribution methods (GradCAM, EP, SOD) and on the ImageNet classification task, which most attribution methods were evaluated on under proxy metrics.
Shen and Huang Shen & Huang (2020) showed users GradCAM Selvaraju et al. (2017), EP Fong et al. (2019), and SmoothGrad Smilkov et al. (2017) heatmaps and measured user-performance in harnessing the heatmaps to identify a label that an AI incorrectly predicts. While they measured the effect of showing all three heatmaps to users, we compared each method separately (GradCAM vs. EP vs. SOD) in a larger study. Similar to Shen & Huang (2020), Alqaraawi et al. Alqaraawi et al. (2020) tested the capability of LRP Binder et al. (2016) attribution maps in helping users understand the decision-making process of AI. Our work differs from the above two papers Shen & Huang (2020); Alqaraawi et al. (2020) in that we did not design our own task to measure human understanding of AIs, but we measured human-AI team performance on a standard downstream task of ImageNet.
5 Discussion and Conclusion
Limitations An inherent limitation of our study is that it is not possible to control the amount of prior knowledge that a participant has before entering the study. For example, a human with a strong dog expertise may perform better at the fine-grained dog classification. In that case, the utility of explanations is unknown in our study. We attempted to estimate the effect of prior knowledge to human-AI team accuracy by asking each user whether they know a class before each trial. We found prior knowledge to account for 1-6% in accuracy (Sec. A5). Due to COVID and the large scale, our study was done online; which, however, made it infeasible for us to control various physical factors (e.g. user performing other activities during the experiment) compared to a physical in-lab study.
To our knowledge, our work is the first to (1) evaluate human-AI team performance on the common ImageNet classification; (2) assess explanations on adversarial examples; (3) reveal the weak correlation between automatic evaluation metrics (Pointing Game, IoU Zhou et al. (2016), and WSL Zhou et al. (2016)) and the actual team performance. Such poor correlation encourages future interpretability research to take humans into their evaluation and to rethink the current automatic metrics. We also showed the first evidence in the literature that a simple 3-NN can outperform existing attribution maps, suggesting a combination of two explanation types might be useful for future work. The superiority of 3-NN also suggests prototypical Chen et al. (2018); Goyal et al. (2019) and visually-grounded explanations Hendricks et al. (2016, 2018) may be more effective than heatmaps.
- (1) Torchray/attribution_benchmark.py at master · facebookresearch/torchray. https://github.com/facebookresearch/TorchRay/blob/master/examples/attribution_benchmark.py. (Accessed on 04/28/2021).
- (2) The fbi’s capitol riot investigation used surveillance technology that advocates say threatens civil liberties - the washington post. https://www.washingtonpost.com/technology/2021/04/02/capitol-siege-arrests-technology-fbi-privacy/. (Accessed on 04/11/2021).
The new lawsuit that shows facial recognition is officially a civil rights issue | mit technology review.https://www.technologyreview.com/2021/04/14/1022676/robert-williams-facial-recognition-lawsuit-aclu-detroit-police/. (Accessed on 04/15/2021).
- (4) vision/resnet.py at master · pytorch/vision. https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py#L183. (Accessed on 04/28/2021).
- (5) facebookresearch/torchray: Understanding deep networks via extremal perturbations and smooth masks. https://github.com/facebookresearch/TorchRay. (Accessed on 04/28/2021).
- Adebayo et al. (2018) Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., and Kim, B. Sanity checks for saliency maps. arXiv preprint arXiv:1810.03292, 2018.
- Agarwal & Nguyen (2020) Agarwal, C. and Nguyen, A. Explaining image classifiers by removing input features using generative models. In Proceedings of the Asian Conference on Computer Vision, 2020.
Alqaraawi et al. (2020)
Alqaraawi, A., Schuessler, M., Weiß, P., Costanza, E., and Berthouze, N.
Evaluating saliency map explanations for convolutional neural networks: a user study.In Proceedings of the 25th International Conference on Intelligent User Interfaces, pp. 275–285, 2020.
- Anwyl-Irvine et al. (2020) Anwyl-Irvine, A. L., Massonnié, J., Flitton, A., Kirkham, N., and Evershed, J. K. Gorilla in our midst: An online behavioral experiment builder. Behavior research methods, 52(1):388–407, 2020.
- Bansal et al. (2020a) Bansal, G., Wu, T., Zhu, J., Fok, R., Nushi, B., Kamar, E., Ribeiro, M. T., and Weld, D. S. Does the whole exceed its parts? the effect of ai explanations on complementary team performance. arXiv preprint arXiv:2006.14779, 2020a.
Bansal et al. (2020b)
Bansal, N., Agarwal, C., and Nguyen, A.
Sam: The sensitivity of attribution methods to hyperparameters.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8673–8683, 2020b.
- Bendale & Boult (2016) Bendale, A. and Boult, T. E. Towards open set deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1563–1572, 2016.
Binder et al. (2016)
Binder, A., Bach, S., Montavon, G., Müller, K.-R., and Samek, W.
Layer-wise relevance propagation for deep neural network architectures.In Information science and applications (ICISA) 2016, pp. 913–922. Springer, 2016.
- Chattopadhay et al. (2018) Chattopadhay, A., Sarkar, A., Howlader, P., and Balasubramanian, V. N. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 839–847. IEEE, 2018.
- Chen et al. (2018) Chen, C., Li, O., Tao, C., Barnett, A. J., Su, J., and Rudin, C. This looks like that: deep learning for interpretable image recognition. arXiv preprint arXiv:1806.10574, 2018.
Chen et al. (2019)
Chen, C., Li, O., Tao, D., Barnett, A., Rudin, C., and Su, J. K.
This looks like that: Deep learning for interpretable image recognition.In Wallach, H., Larochelle, H., Beygelzimer, A., dAlché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/adf7ee2dcf142b0e11888e72b43fcb75-Paper.pdf.
- Chu et al. (2020) Chu, E., Roy, D., and Andreas, J. Are visual explanations useful? a case study in model-in-the-loop prediction. arXiv preprint arXiv:2007.12248, 2020.
- Covert et al. (2020) Covert, I., Lundberg, S., and Lee, S.-I. Feature removal is a unifying principle for model explanation methods. arXiv preprint arXiv:2011.03623, 2020.
- Das & Rad (2020) Das, A. and Rad, P. Opportunities and challenges in explainable artificial intelligence (xai): A survey. arXiv preprint arXiv:2006.11371, 2020.
- Doshi-Velez & Kim (2017) Doshi-Velez, F. and Kim, B. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017.
- Doshi-Velez et al. (2017) Doshi-Velez, F., Kortz, M., Budish, R., Bavitz, C., Gershman, S., O’Brien, D., Schieber, S., Waldo, J., Weinberger, D., and Wood, A. Accountability of ai under the law: The role of explanation. arXiv preprint arXiv:1711.01134, 2017.
- Du et al. (2018) Du, M., Liu, N., Song, Q., and Hu, X. Towards explanation of dnn-based prediction with guided feature inversion. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1358–1367, 2018.
- Elliott et al. (2019) Elliott, A., Law, S., and Russell, C. Perturbations on the perceptual ball. arXiv preprint arXiv:1912.09405, 2019.
- Feng & Boyd-Graber (2019) Feng, S. and Boyd-Graber, J. What can ai do for me? evaluating machine learning interpretations in cooperative play. In Proceedings of the 24th International Conference on Intelligent User Interfaces, pp. 229–239, 2019.
- Fong et al. (2019) Fong, R., Patrick, M., and Vedaldi, A. Understanding deep networks via extremal perturbations and smooth masks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2950–2958, 2019.
- Fong & Vedaldi (2017) Fong, R. C. and Vedaldi, A. Interpretable explanations of black boxes by meaningful perturbation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3429–3437, 2017.
- Goodman & Flaxman (2017) Goodman, B. and Flaxman, S. European union regulations on algorithmic decision-making and a “right to explanation”. AI Magazine, 38(3):50–57, 2017.
- Goyal et al. (2019) Goyal, Y., Wu, Z., Ernst, J., Batra, D., Parikh, D., and Lee, S. Counterfactual visual explanations. In International Conference on Machine Learning, pp. 2376–2384. PMLR, 2019.
- He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Hendricks et al. (2016) Hendricks, L. A., Akata, Z., Rohrbach, M., Donahue, J., Schiele, B., and Darrell, T. Generating visual explanations. In European Conference on Computer Vision, pp. 3–19. Springer, 2016.
- Hendricks et al. (2018) Hendricks, L. A., Hu, R., Darrell, T., and Akata, Z. Generating counterfactual explanations with natural language. arXiv preprint arXiv:1806.09809, 2018.
- Khosla et al. (2011) Khosla, A., Jayadevaprakash, N., Yao, B., and Fei-Fei, L. Novel dataset for fine-grained image categorization: Stanford dogs. In First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, June 2011.
Lapuschkin et al. (2016)
Lapuschkin, S., Binder, A., Montavon, G., Muller, K.-R., and Samek, W.
Analyzing classifiers: Fisher vectors and deep neural networks.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2912–2920, 2016.
- Liu et al. (2019a) Liu, J.-J., Hou, Q., Cheng, M.-M., Feng, J., and Jiang, J. A simple pooling-based design for real-time salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3917–3926, 2019a.
- Liu et al. (2019b) Liu, Z., Liu, Q., Liu, T., Xu, N., Lin, X., Wang, Y., and Wen, W. Feature distillation: Dnn-oriented jpeg compression against adversarial examples. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 860–868. IEEE, 2019b.
- Lu et al. (2020) Lu, Y., Guo, W., Xing, X., and Stafford Noble, W. Robust decoy-enhanced saliency maps. arXiv e-prints, pp. arXiv–2002, 2020.
- Lundberg & Lee (2017) Lundberg, S. and Lee, S.-I. A unified approach to interpreting model predictions. arXiv preprint arXiv:1705.07874, 2017.
- Madry et al. (2017) Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
Marcel & Rodriguez (2010)
Marcel, S. and Rodriguez, Y.
Torchvision the machine-vision package of torch.In Proceedings of the 18th ACM international conference on Multimedia, pp. 1485–1488, 2010.
- Miller (1995) Miller, G. A. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
- Nauta et al. (2020) Nauta, M., Jutte, A., Provoost, J., and Seifert, C. This looks like that, because… explaining prototypes for interpretable image recognition. arXiv preprint arXiv:2011.02863, 2020.
- Nguyen et al. (2015) Nguyen, A., Yosinski, J., and Clune, J. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), CVPR ’15, pp. 427–436. IEEE, June 2015. doi: 10.1109/CVPR.2015.7298640.
- Nguyen et al. (2019) Nguyen, A., Yosinski, J., and Clune, J. Understanding neural networks via feature visualization: A survey. In Explainable AI: interpreting, explaining and visualizing deep learning, pp. 55–76. Springer, 2019.
- Palan & Schitter (2018) Palan, S. and Schitter, C. Prolific. ac—a subject pool for online experiments. Journal of Behavioral and Experimental Finance, 17:22–27, 2018.
- Papernot et al. (2016) Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik, Z. B., and Swami, A. The limitations of deep learning in adversarial settings. In 2016 IEEE European symposium on security and privacy (EuroS&P), pp. 372–387. IEEE, 2016.
- Peer et al. (2017) Peer, E., Brandimarte, L., Samat, S., and Acquisti, A. Beyond the turk: Alternative platforms for crowdsourcing behavioral research. Journal of Experimental Social Psychology, 70:153–163, 2017.
- Petsiuk et al. (2018) Petsiuk, V., Das, A., and Saenko, K. Rise: Randomized input sampling for explanation of black-box models. arXiv preprint arXiv:1806.07421, 2018.
- Rajpurkar et al. (2017) Rajpurkar, P., Irvin, J., Zhu, K., Yang, B., Mehta, H., Duan, T., Ding, D., Bagul, A., Langlotz, C., Shpanskaya, K., et al. Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv preprint arXiv:1711.05225, 2017.
- Rauber et al. (2017) Rauber, J., Brendel, W., and Bethge, M. Foolbox: A python toolbox to benchmark the robustness of machine learning models. In Reliable Machine Learning in the Wild Workshop, 34th International Conference on Machine Learning, 2017. URL http://arxiv.org/abs/1707.04131.
Rebuffi et al. (2020)
Rebuffi, S.-A., Fong, R., Ji, X., and Vedaldi, A.
There and back again: Revisiting backpropagation saliency methods.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8839–8848, 2020.
- Rudin et al. (2021) Rudin, C., Chen, C., Chen, Z., Huang, H., Semenova, L., and Zhong, C. Interpretable machine learning: Fundamental principles and 10 grand challenges. arXiv preprint arXiv:2103.11251, 2021.
- Russakovsky et al. (2015) Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
- Selvaraju et al. (2017) Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626, 2017.
- Shakhnarovich et al. (2005) Shakhnarovich, G., Darrell, T., and Indyk, P. Nearest-neighbor methods in learning and vision. In Neural Information Processing, 2005.
- Shen & Huang (2020) Shen, H. and Huang, T.-H. How useful are the machine-generated interpretations to general users? a human evaluation on guessing the incorrectly predicted labels. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, volume 8, pp. 168–172, 2020.
- Simonyan et al. (2013) Simonyan, K., Vedaldi, A., and Zisserman, A. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
- Smilkov et al. (2017) Smilkov, D., Thorat, N., Kim, B., Viégas, F., and Wattenberg, M. Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825, 2017.
- Szegedy et al. (2013) Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
- Wang et al. (2020) Wang, H., Wang, Z., Du, M., Yang, F., Zhang, Z., Ding, S., Mardziel, P., and Hu, X. Score-cam: Score-weighted visual explanations for convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 24–25, 2020.
- Wu et al. (2019) Wu, N., Phang, J., Park, J., Shen, Y., Huang, Z., Zorin, M., Jastrzębski, S., Févry, T., Katsnelson, J., Kim, E., Wolfson, S., Parikh, U., Gaddam, S., Lin, L. L. Y., Ho, K., Weinstein, J. D., Reig, B., Gao, Y., Pysarenko, H. T. K., Lewin, A., Lee, J., Airola, K., Mema, E., Chung, S., Hwang, E., Samreen, N., Kim, S. G., Heacock, L., Moy, L., Cho, K., and Geras, K. J. Deep neural networks improve radiologists’ performance in breast cancer screening. IEEE Transactions on Medical Imaging, pp. 1–1, 2019. doi: 10.1109/TMI.2019.2945514.
- Zhang et al. (2018) Zhang, J., Bargal, S. A., Lin, Z., Brandt, J., Shen, X., and Sclaroff, S. Top-down neural attention by excitation backprop. International Journal of Computer Vision, 126(10):1084–1102, 2018.
- Zhang et al. (2020) Zhang, Y., Liao, Q. V., and Bellamy, R. K. Effect of confidence and explanation on accuracy and trust calibration in ai-assisted decision making. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 295–305, 2020.
Zhou et al. (2016)
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Torralba, A.
Learning deep features for discriminative localization.In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2921–2929, 2016.
Appendix A1 Break-down human-AI team accuracy on image subsets
Controlled-to-original accuracy conversion
While our experiments were not conducted on the real distributions of ImageNet and Dogs, we can still estimate the human accuracy of the explanation methods on the original datasets using the real ratios of subsets () presented in Fig. A1.
The estimated accuracy was reported in Fig. 2. It should be noted that the above conversion was applied for natural images only because adversarial images were synthetically generated, so no original distribution exists.
Appendix A2 How effective attribution maps in improving human-AI team accuracy on the controlled distribution?
On average, over the controlled distribution (i.e. 6 bins, each of 50 images), we found that 3-NN is the best method to help end-users categorize ImageNet images. 3-NN outperforms other explanation methods by at least 2.71% (Table A2; 76.59%). However, in fine-grained dog classification, explanations seem to have low utility. All methods scored nearly the same and close to the random baseline (i.e. 50% accuracy).
Appendix A3 Are attribution maps more effective than 3-NN in improving human accuracy on Easy, Medium, and Hard images of AI on the controlled distribution?
In Table A3, we found 3-NN defeating other attribution methods and other baselines by a wide margin on easy and hard images of ImageNet classification task. It surpassed the second best methods (EP on E and Confidence on H) in each image set by 3.13% on average.
Yet, 3-NN provided very little information for users working with the Dogs experiment. None of attribution methods performed well on easy and hard images of dogs. Because the differences among dog classes are minimal, only rightly pointing out discriminative features could help users.
When AI classifies an image with confidence score around 0.5 (medium images), GradCAM was the best explanation method on both ImageNet and Dogs data, achieving 75.24% and 58.08%, respectively. However, the usefulness of explanation methods for users with dog images was insignificant since the net improvements over the random choice baseline were minimal.
Appendix A4 Are attribution maps more effective than 3-NN in improving human accuracy on correct/wrong images of the controlled distribution?
Explanations increase the chance humans agree with AI’s predictions in sentiment classification and question answering Bansal et al. (2020a). While we examined the human accuracy on correct and images of AI, it has been shown that explanations also encouraged humans to accept AI’s predictions rather than reject in image classification.
In Table A4, regarding ImageNet images, 3-NN was the most useful method for AI error identification, at 63.33%. Besides, the explanations from 3-NN and GradCAM seemed to benefit equally to humans when AI classify correctly (89.28% and 89.14%, respectively).
EP and 3-NN did the best in explaining network’s correct predictions on dog images (79.03% and 79.56%). We were surprised because all methods did not outperform the random choice approach (50%) to identify misclassifications of ResNet-34 on Dogs images.
Appendix A5 How does prior knowledge affect human accuracy?
Although we tried to mitigate the effect of prior knowledge by giving users the definition and sample images of categories, we still would like to examine how differently users score when never-seen-before vs. already-known objects are presented.
We let denote the gap between accuracy on known images vs. accuracy on unknown images within a method. Looking into Table A5, SOD was affected the most by prior knowledge compared to other explanation methods with of 6.38% in ImageNet or 2.21% in Dogs. An explanation for the above observation is that because SOD is model-agnostic, it provides the least classifier-relevant information, and therefore humans would have to rely on the prior knowledge the most to make decisions.
Appendix A6 AI-only thresholds
Using confidence score to automate the tasks, we found the two optimal thresholds are 0.55 and 0.50 for ImageNet and Dogs, respectively (see Table A6). These threshold values were tuned on around 50K ImageNet and 6K Stanford Dogs images.
Appendix A7 Original distributions of ImageNet and Stanford Dogs
shows the original distributions of ImageNet and Dogs by confidence intervals. Although we skipped images having confidence in [0.3, 0.4) and [0.6, 0.8), the human accuracy on the remaining intervals can still represent that of the original distributions because the numbers of images in those intervals are not major.
Table A7 shows the distributions of easy, medium, and hard images of ImageNet and Dogs dataset. It should be noted that we ignored a few confidence intervals so the total numbers of images are not 50K and 6K.
|Images||Percentage (%)||Images||Percentage (%)|
Appendix A8 Participants’ statistics
Table A8 shows the numbers of users, trials, and users per image in ImageNet and Dogs experiment for each method.
|Users||Trials||Users per image||Users||Trials||Users per image|
Appendix A9 Participants’ acceptance-rejection rate
Table A9 shows the ratios of acceptance and rejection across methods in ImageNet and Dogs (natural examples only). Consistent with Bansal et al. (2020a), we found that explanations increase the chance humans accept AI’s predictions except SOD. As mentioned in Sec. A13.5, SOD users were most likely to reject AI’s labels because this baseline gave users bad-quality heatmaps.
Appendix A10 Sample experiment screens
Here we show our experiment user interface by screens.
We introduced Sam - the AI and explained the tasks to participants. Later, we explicitly restricted the device used for the experiments to ensure the display resolution will not affect human decisions.
We gave users 5 trials where each of them is followed by a feedback screen on users’ decision. For GradCAM, EP, and SOD, as their heatmaps have the similar format, the same interpretation was provided as shown in Fig. 3(a). 3-NN displays to users three correct images of the predicted class in Fig. 3(b), while Confidence only shows the input image, the predicted label, and the confidence score.
a10.3 Validation and Test
While we have 10 trials in evaluation and 30 trials in test, we did not tell participants about the validation phase to avoid overfitting. These 40 trials were showed continuously in which the definition and sample images of the predicted class were given beforehand as in Fig. A5. No feedback was given in validation and test.
As the purpose of validation was to filter out users not paying enough attention to the experiments (e.g. random clicking), we carefully chose clearly wrong and correct images by ResNet-34 (Fig. A6) to check users’ attention. It should be noted that the definition and sample images of the predicted category were given in advance.
Appendix A11 Images mislabeled into non-dog categories (MIND samples)
In Dogs experiment, images misclassified into non-dog categories (MIND) were discarded because users often can instantly reject those images without explanation. As shown in Fig. A7, almost all MIND samples contain more than one object.
Appendix A12 Example explanations displayed to users
Below are example of explanations taken from the screens displayed to users during our user-study. Here, we show the differences between GradCAM, EP, SOD, and 3-NN explanations for the same input image predicted as “american coot” (Fig. A8).
Appendix A13 Qualitative examples supporting our findings
a13.1 Hard, real ImageNet images that were correctly labeled by 3-NN users but not GradCAM, EP, or SOD users
Regarding hard images, we observed that images corrected by 3-NN but not attribution maps and SOD often contain multiple concepts (Fig. A11), low quality (Fig. A9), look-alike objects (Fig. A10 and A14), only a part of the main object (Fig. A12), or objects with unusual appearances (Fig. A13). On these images, while heatmaps did not highlight the discriminative features and the confidence score was low, users tended to reject AI’s labels. In contrast, 3-NN helped users gain confidence that AI is correct when there are multiple plausible labels.
a13.2 Medium, AI-misclassified, real ImageNet images that were incorrectly accepted by 3-NN users
a13.3 Easy, AI-misclassified, ImageNet images that were correctly rejected by 3-NN users but not GradCAM and SOD users
a13.4 Adversarial ImageNet images that were correctly rejected by 3-NN users but not GradCAM, EP, or SOD users
In Fig. 2, what made 3-NN more effective than attribution maps on Adversarial ImageNet?
As the adversarial attacks fooled AI by small perturbations, the misclassified labels are not far from the ground truth (e.g. bee eater to lorikeet in Fig. A22 or collie to shetland sheepdog in Fig. A26). The highlights of heatmaps focused on parts of the main object and made the explanations compelling to users. 3-NN helped users differentiate the two categories by looking at the constrastive images of the predicted label and the ground truth (Fig. A21 and Fig. A22).
a13.5 AI-misclassified Dogs images that are correctly rejected by SOD users but not GradCAM, EP, or 3-NN users
In Table A1, what made SOD significantly more effective than other methods on correcting AI-misclassified images of Dogs?
We found that GradCAM and EP often highlighted the entire face of the dogs, which made the heatmaps persuasive to users although the predictions were wrong. Regarding 3-NN, the misclassified category is visually similar to the ground truth (i.e. eskimo dog vs. malamute), which was challenging for lay users to distinguish. We assume that users expected explanations to be as specific and relevant as possible because the differences among breeds are minimal. SOD highlighted the entire body of the dogs (Fig. A23) or even irrelevant areas (Fig. A24). This explains why users with SOD tended to ignore rather than trust the AI. While 3-NN users leveraged the information of nearest neighbors to identify AI’s errors, SOD users rejected predictions because of the heatmaps’ low quality, which unintentionally improved the accuracy on wrong Dogs images. The rejection rates of SOD users were highest in both ImageNet (38.03%) and Dogs (37.34%) as shown in Table A9. Indeed, due to the high rejection rate of SOD users, the accuracy on correct Dogs images was lowest as shown in Table A1.