With the recent and continuing advancements in robust deep neural networks, the prominence of machine learning and artificial intelligence models is growing for automated decision-making support and especially in critical areas such as financial analysis, medical management systems, military planning, and autonomous systems. In such cases, human experts, operators, and decision makers can take advantage of new machine learning techniques to assist in taking real-world actions. In order to do so, however, these people need to be able to trust and understand the machine outputs, predictions, and recommendations. Unlike shallow machine learning models that can be interpretable and easier to understand in terms of classification logic, deep learning models are significantly more complex and often considered as black-box models due to their poor transparency and understandability.
Thus, for machine-assisted decision-making using new machine learning technology, advancements are needed in achieving explainability and supporting human understanding. This is the primary goal of recent research thrusts in explainable artificial intelligence (XAI). For a system to effectively serve human users, people need to be able to understand the reasoning behind the machine’s decisions and actions. Numerous researchers have recently been working to advance explainability through methods such as visualization [10, 4, 11] or model simplification [9, 16]. Different types of interpretability and explainability are possible. For human explainability, for instance, local explanations can be used to explain the connection between a single input instance the the resulting machine output, while global explanations aim to provide more holistic presentation of how the system works as a whole or for collections of instances. While a multi-faceted topic, the ultimate goal is for people to can understand machine models, and it is therefore important to involve human feedback and reasoning as a requisite component for evaluating the explainability or understandability of XAI methods and models. However, since the majority of research in the area of XAI is lead by experts in machine learning and artificial intelligence, relatively little work has involved human evaluation.
In this paper, we describe an novel evaluation methodology for assessing the relevance and appropriateness of local explanations of machine output. We present a human-grounded evaluation benchmark for evaluating instance explanations of images and textual data. The benchmark consists of human-annotated samples of images and text articles to approximate the most important regions for human understanding and classification. By comparing the explanation results from classification models to the benchmark’s annotation meta-data, it is possible to evaluate the quality and appropriateness of XAI local explanations. To demonstrate the utility of such a benchmark, we perform a quantitative evaluation of explanations generated from a recent machine learning algorithm. We have also made the benchmark publicly available online for research purposes.
Researchers have argued the importance of interpretable machine learning and how its demand rises from the incompleteness of problem formalization (e.g., ). For instance, in many cases, user might lose trust to the system with the doubt that if the machine has taken all necessary factors into account. In this situation, an interpretable model can assist user with generating explanations. Lipton  states interpretable machine learning is needed when there is a mismatch between machine objectives and real-world scenarios, which means a transparent machine learning model should share information and decision making details with a user to prevent mismatch objectives problems. The goals of XAI naturally motivate a merger between human-computer interaction (HCI) and artificial intelligence (AI) disciplines for the creation and evaluation of solutions that are interpretable and explainable for users. It is important that these communities work together to achieve useful and meaningful explanations of machine learning technology.
2.1 Explanation Strategies
Interpretable models such as tree-based models  and rule lists  have been proposed as examples that can be directly explained or summarized using relatively simple or common visualization methods. For more complex black-box models such as deep neural networks (DNN), other methods have been explore to generate local explanations for each individual instance as well as for global explanations of the entire model. Local explanations in the form of salience maps is a popular way to generate explanations in DNNs. This approach presents features with the greatest contribution to the classification. For example, Simonyan et al. , used output gradient to generate a mask of which pixels is the model relying on for classification task. In other work, Ribeiro et al.  presented a model-agnostic algorithm that generates local explanations for any classifier in different data domains. As another example, Ross et al. 
proposed an iterative approach using an input gradient that can improve its explanation by constraining explanation by a loss function.
generate a 2D mapping of high-dimensional data to visualize spatial relation of data clusters. Visual analytics tools such as ActiVIS and Squares  take advantage of a 2D mapping of data points along with feature-cluster and instance-cluster views to help users with performance analysis and in understanding classification logic.
2.2 Evaluating Explanations
In considering evaluation approaches for XAI, Doshi-Velez  proposed three categories: application-grounded, human-grounded and functionality-grounded evaluations. These categories vary in evaluation cost and inclusiveness. In this taxonomy, functionality-grounded evaluation uses formal definitions of interpretability as a proxy for qualifying explanations and no human research is involved. Application-grounded evaluation is done with expert users reviewing the model and explanations in real tasks. Analytics tools like ActiVis  with participatory design procedure and case studies show satisfactory results from the expert users in machine learning field. Krause et al.  also proposed a visual analytics tool for the medical domain to debug binary classifiers with instance-level explanations. They worked tightly with the medical team and hospital management to optimize processing times in the emergency department of the hospital.
In contrast, human-grounded evaluations are generally performed with non-expert users and simplified tasks. To date, there are few research studies involving human subjects to assess XAI. Ribeiro et al.  presented an experiment to study whether users can identify the best classifier using their explanations. In their study, participants reviewed explanations generated for two image classifiers. They also performed a small study where the researchers intentionally trained a classifier incorrectly with biased data to study whether participants could to identify the connection between the incorrect features the resulting erroneous classifications. Also studying interpretability for people, Lakkaruj et al.  conducted research with interpretable decision sets, which are groups of independent if-then rules. They evaluated interpretability through a user study where participants looked at the decision-set rules and answered a set of questions to measure their understanding of the model. The authors reported that both accuracy and average time spent in understanding the decisions was improved with their interpretable decision sets comparing to a baseline with bayesian decision lists.
3 Human Evaluation Approaches
(b)We discuss two main classes of approaches for human evaluation for interpetability, with the difference depending on whether users have prior knowledge or access to sample explanations. In one way, users review existing explanations and provide specific feedback for those explanations. The other option is to capture users’ thoughts and opinions of the most relevant features based on the input and output without review of example explanations.
The explanations could be in any form such as verbal or local explanations and on any data such as image, text, or tabular data. The following subsections provide further description of each of human-grounded local explanations evaluation types for machine learning .
3.1 Evaluating with Explanation Review and Feedback
For the purposes of evaluating existing known explanations, it is possible to collect user feedback about the quality of the explanation given the original input and the resulting output. For example, users could review several options and choose the best machine-generated explanation for a straight-forward comparison.
User decisions are made with knowledge about the input, the explanations, and the output label. We would expect users to generally pick explanations that most closely match their logic and background knowledge. One advantage of this method is the ability for a clear comparison of multiple interpretable machine learning algorithms. Another means of capturing user feedback would be letting a user interactively refine machine generated explanations. This method has more flexibility in allowing rejection of wrong features and adding new features to the explanations. Quantifying the difference between an initial given explanation and a user-edited explanation could give a clear measure of quality for the initial machine-generated explanations. The disadvantage of this method is that human review is always a comparison relative to an existing explanation, which means (1) some form of explanation must already exist, (2) the evaluation is specific to the particular explanations reviewed, and (2) reviewing the existing explanation might bias a user’s perception.
3.2 Evaluating with Input Review and Feedforward
Another option for human evaluation is to collect feedback about the features that would best contribute towards an explanation for a given output—and users could provide such information without seeing example explanations. For example, explanations could be obtained by presenting the user with the input and output label and then asking to find the relevant features corresponding to the label. For example, if the data is a text article about a “computer science” topic, the user would find and annotate words and phrases related to the topic. User choice is made with knowledge about the input along with the output label. Increasing the number of users results in capturing a wide spectrum of user explanations on each input. In this method, explanations are weighted features from multiple user opinions. For example, Figure 3 examples of text and image heatmaps generated by this approach for our benchmark.
This can be thought of as a feedforward approach, as the information from reviewers would be independent of any particular explanation. Consequently, this approach can result in a reusable benchmark that can apply to various explanations.
4 Evaluation Benchmark
Because our goal was a benchmark that could be used for evaluation of known input and classifications, we captured explanations in a feedforward approach where the users were asked to annotate relevant regions in images and words in text articles that are most related to the topic or subject. The preliminary deployment of this benchmark consists of a subset of 100 sample images and text articles from the well-known ImageNet  and 20 Newsgroup  data sets. The initial version of this benchmark is available online 111https://github.com/SinaMohseni/ML-Interpretability-Evaluation-Benchmark for research purposes.
[0pc] (a) (b)
4.1 Annotated Image Examples
All image samples were collected from the ImageNet data set from 20 general categories (example categories include animals, plants, humans, indoor objects, and outdoor objects). Our preliminary benchmark includes 5 images per category for a total of 100 images. In a review-board approved user study, 10 participants viewed images on a tablet and used a stylus to annotate key regions of the image. We asked them to draw a contour in the image around the area most important to recognizing the object, or the portion that, if removed, you could not recognize the object. None of the participants were experts in any of the image categories. Each participant annotated all images in a random ordering.
All user annotations are accumulated to create a weighted explanation mask (see Figure 4a) over the image. Figure 3a shows a heatmap views of user annotated explanations over two sample image, where “hot” colors (red) shows more commonly highlighted regions, and “cooler” colors (blue) show areas that were highlighted less frequently. We also masked all user annotations with exact contour shapes to reduce the impact of user imprecision or hand jitter.
4.2 Annotated Text Examples
All text articles were collected from two categories (medical or sci.med, and electronic or sci.elect) from the 20 Newsgroup data set. For each category, expert reviewers highlighted the most important words relevant to the given topic name (i.e., medical or electronic). Reviewers were instructed to highlight words which, if removed, you could not recognize the main topic of the article. Two electrical engineers and two physicians volunteered as experts to annotate 100 documents from each topic. Figure 3(b) shows a single tone heat map view of user annotated explanations over a partial sample text article.
4.3 Use Case
To demonstrate the utility of our benchmark, we present a use case in evaluating local explanations from the well-known LIME explainer . Similar to the previously presented research on LIME , we used the pre-trained model from Google’s Inception v3  for image classification.
Next, we compared the machine-generated explanation with our evaluation benchmark. The comparison is done pixel-wise for each image sample We compared our weighted-masks (see Figure 4
a) to the LIME results for all 100 images in our benchmark set. We calculated true positive, false positive and false negative pixels with bit-wise operations, and precision and recall for the set were calculated as 0.39 and 0.58, respectively. The low precision is indicative of extraneous irrelevant regions of the images being highlighted in explanations by the LIME algorithm. Figure4b shows an example of image explanations from the LIME algorithm where two of the red highlighted patches show regions that do not correspond to the cat in the image. Using this evaluation method, we would hope to see algorithms produce local explanations with closer alignment to user annotations.
This research is based on work supported by the DARPA XAI program under Grant #N66001-17-2-4031.
-  Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In IEEE, 248–255.
-  Finale Doshi-Velez and Been Kim. 2017. Towards a rigorous science of interpretable machine learning. (2017).
-  Minsuk Kahng, Pierre Y Andrews, Aditya Kalro, and Duen Horng Polo Chau. 2018. A cti V is: Visual Exploration of Industry-Scale Deep Neural Network Models. IEEE transactions on visualization and computer graphics 24, 1 (2018), 88–97.
-  Josua Krause, Aritra Dasgupta, Jordan Swartz, Yindalon Aphinyanaphongs, and Enrico Bertini. 2017. A Workflow for Visual Diagnostics of Binary Classifiers using Instance-Level Explanations. arXiv preprint arXiv:1705.01968 (2017).
-  Himabindu Lakkaraju, Stephen H Bach, and Jure Leskovec. 2016. Interpretable decision sets: A joint framework for description and prediction. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1675–1684.
-  Ken Lang. 1995. Newsweeder: Learning to filter netnews. In Proceedings of the Twelfth International Conference on Machine Learning. 331–339.
-  Zachary C Lipton. 2016. The mythos of model interpretability. arXiv preprint arXiv:1606.03490 (2016).
-  Yin Lou, Rich Caruana, and Johannes Gehrke. 2012. Intelligible models for classification and regression. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 150–158.
-  Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, Nov (2008), 2579–2605.
-  Donghao Ren, Saleema Amershi, Bongshin Lee, Jina Suh, and Jason D Williams. 2017. Squares: Supporting interactive performance analysis for multiclass classifiers. IEEE transactions on visualization and computer graphics 23, 1 (2017), 61–70.
-  Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1135–1144.
-  Andrew Slavin Ross, Michael C Hughes, and Finale Doshi-Velez. 2017. Right for the Right Reasons: Training Differentiable Models by Constraining their Explanations. arXiv preprint arXiv:1703.03717 (2017).
-  Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2013. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013).
-  Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2818–2826.
-  Fulton Wang and Cynthia Rudin. 2015. Falling rule lists. In Artificial Intelligence and Statistics. 1013–1022.