AI systems have matured and are on the rise to become an integral part of the real world in applications that span across our entire society. The performance of such AI systems is mostly validated in terms of accuracy against a labeled ground-truth dataset. Even if this is often appropriate, it poses the challenge that such validation frameworks cannot be transferred directly to validate AI systems that provide solutions in terms of a prediction and an explanation, or that exceed human performance. The problem of how to validate explainability methods is vividly discussed and investigated, leading to diverse frameworks—for instance, the concepts of meta-predictor (Fel et al., 2021) and simulatability (Doshi-Velez & Kim, 2017) are only proxies that cannot measure an AI system’s performance in comparison to a human expert.
We describe a generic framework to assess AI systems in a blind experiment, where three domain experts interact in a collaborative environment. One domain expert is a human lead expert, who picks the tasks to be solved and accepts or rejects the provided solutions. Next, each task is solved by a domain expert, either a human or an AI system, whereby the leading expert has no information about who solved the task nor that an AI system might have solved it. Our framework assesses the performance of the AI system compared to the human expert, by estimating the chances that the lead expert accepts a solution provided by either the human or the AI system.
Consider, for example, the assessment of medical laboratories: a leading laboratory (maybe hired by some authority) sends test specimens (the tasks) anonymously to another laboratory. After analyzing the specimens, the laboratory returns the results (solutions). The leading laboratory evaluates the results knowing the sent specimens and reports the acceptance rate of the assessment. What the leading laboratory does not know is that the specimens are analyzed by either a human expert or fully automatized by a machine so that the acceptance rate refers to the human or the system. By comparing the acceptance rate of the human with the system, the quality of the system is assessed. This setup allows an unbiased validation of whether or not it is acceptable to have a machine perform the analysis in place of a human.
In the following, we will describe the proposed assessment framework in detail. Next, to demonstrate the generalizability of this framework, we show how the ordinary measure of classification accuracy emerges from a specific instantiation of the framework and allows us to measure label uncertainty. Additionally, we describe an instantiation to assess the usefulness of AI explainability methods by designing a setup where the lead expert requires an explanation to make a proper assessment in a short amount of time—thus, this setup measures the usability of an explanation method.
The outline of the paper is as follows: The next section defines and discusses the assessment framework and introduces two instantiations as examples. Then, we discuss related work and finish with a conclusion and an outlook.
2 Assessment Framework
The proposed assessment framework can measure how well an AI system performs a task compared to a human expert. First, we give a formal definition followed by a discussion. Second, we outline two instantiations of the framework.
2.1 Formal Definition
Consider the situation in Figure 1. Our assessment framework consists of three domain experts (or groups of experts): a Lead expert (L), an Expert (E), and an AI System (S). The lead expert L assigns a task via a well-defined communication channel to one of the colleagues (either E or S). The assignment is made at random and L does not know who will solve the task, nor that there are different solvers involved. After the assigned colleague solves the task, the solution is returned to L via a well-defined communication channel. Then, L decides whether to accept or reject the solution based on specified approval guidelines.111Acceptance means conformity with the approval guidelines. Thus, a rejection does not imply that the individual parts of a solution (for instance, the prediction and the explanation) are incorrect. Thus, L assesses (evaluates) the solution for the given task, which does not necessarily imply that L has to solve the task again. To compute the acceptance rate, the decision whether the solution is accepted is mapped to the colleague who solved the task (the solver does not know whether their solution was accepted). By repeating the test for several tasks of the domain, we can estimate the acceptance rates for E and S.
For a system , an expert , and a lead expert , the assessment
consists of determining the empirical probabilities that solutionsfor tasks that are randomly drawn by the lead expert L and are randomly solved by the system S or the expert E are accepted by the lead expert L:
where the individual task sets and are subsets of the task set and is the empirical probability that a solution provided by the system S will be accepted by the lead expert L (analogous interpretation for ).
The following outcomes are possible: (1) The AI system performs worse than the expert if ; (2) The AI system behaves like the expert if ; (3) The AI system exhibits superhuman abilities if .
Note that the assessment of a medical lab mentioned in Section 1 can be mapped to the assessment framework definition. Moreover, the framework is unbiased and human-centric. Unbiased in the sense that the lead expert does not know that there is an AI involved and, thus, evaluates solutions from a human-centric perspective. Additionally, by always involving a human and an AI for task solving, it is required to define how to solve a task and how to communicate with L, which makes the task description and solution communication human-centric as well. For explainable AI, this postulate immediately disqualifies explanation methods that produce explanations that are not suited for human interpretation. Therefore, with a common acceptance of our framework, future explainable AI research can consider how human-centric an explanation method is during its early conception. This is desired as explanations are generated for the sole purpose of being useful for humans. Finally, because the framework always provides a human baseline performance through E, it can quantify superhuman performance.
2.2 Assumptions, Remarks, and Discussion
Domain, language, tasks, and solutions:
The test is fixed to a certain domain with experts, and the communication is limited to understanding tasks () and solutions (). These communications require the languages to be well-defined so that all three parties can understand tasks and solutions, and that E and S can formulate solutions in an unimpeded manner. Namely, E and S can communicate well with L using the same languages, and L cannot determine which party is providing a solution based on the language used. At the same time, this ensures that a human can understand the explanation produced by S.
Additionally, for each domain, the task must be well-defined so that the criteria for its completion are unambiguous. In other words, it is obvious what has to be done. For example, in object recognition, annotation guidelines clearly specify what an object is, how to annotate it, and, thus, what solutions are expected. Task definition becomes especially important in the context of explainable AI when the solvers have to return an explanation alongside the prediction because it requires defining the expected explanation (e. g., what should be highlighted by a saliency map). Moreover, these definitions set the rules for how E should solve a task to control human subjectiveness. Finally, note the solution language might contain a word for “no solution derived” to ensure a solution is always returned even if the AI system encounters errors or the expert cannot provide a solution.
The test requires that the lead expert is interested in assessing the solvers by evaluating the solutions following the approval guidelines. If this is not the case, the lead expert could accept any solution, which would lead to the logical consequence that S and E perform equally well because no domain-specific task-solving abilities are required to provide acceptable solutions.
Importantly, it is not required that L can solve tasks (contrarily to E and S). However, L must be able to evaluate task-solution pairs even if it is time-consuming, otherwise, the assessment (or validation) of any system is impossible. Consider AlphaFold (Jumper et al., 2021): protein structures predicted by the model must be evaluated by experiments to confirm correctness. Though time-consuming (but possible), it was used to validate the outstanding model performance.
The approval guidelines are of utmost importance for the evaluation of solutions. Similar to the precise task description (which is related to annotation guidelines), the approval guidelines must specify as precisely as possible how a solution must be evaluated. Every undefined aspect will be impacted by the subjectiveness of the domain lead expert, which can lead to intended or unintended biased evaluations.222Tasks with known solutions can be injected in the assessment framework to control the compliance with the approval rules of L and task solving rules of E.
2.3 Assessment of Classification Accuracy
This example instantiation shows the generalizability of the framework: it can measure the classification accuracy (with label uncertainty) of an AI system S on a given test set , where is an input annotated with the class label . In the context of the framework, the inputs represent the task set , and the possible class labels form the solution set so that the framework assesses the provided class labels of inputs. Additionally, since each was annotated by a human expert, it is feasible to assume that the corresponding label represents the solution of the expert E: . Now, we can define the classification accuracy of a system S with respect to the lead expert L by
If we further assume that the lead expert L accepts the solution for a task if and only if , then the probability to accept solutions provided by the lead expert E becomes , and the classification accuracy with respect to the lead expert becomes the canonical classification accuracy used to assess the performance of a system S:
If the acceptance criteria of the lead expert L would not be the class label of the test input but really an acceptance evaluation of a human expert, then we would naturally identify labels where human experts disagree so that the label uncertainty can be assessed.
2.4 Assessment of the Usefulness of Image Classification Explanations
Several researchers investigated the usefulness of explanations in different experimental settings (see Section 3). To validate whether explanations are useful and help users to assess the correctness of a prediction, we propose an experiment based on the assessment framework with lead experts that have a slight color vision deficit such that they need explanations to assess the predictions for colorblind images derived from MNIST (LeCun et al., 1998), see Figure 2, in a short amount of time. Here, the controlled independent variable is whether an explanation is presented. The dependent variable is the acceptance rate for a given amount of approval time. We determine the usefulness of human-understandable explanations by computing the changes in the acceptance rate between the assessment with and without an explanation. This experiment is a suitable instantiation of the framework as it only requires that experts know the Arabic numerals and aptly uses the color perception abilities of humans to assess the usefulness of explainability methods with a reduced experimental bias.
In this instantiation, the AI system S is a neural network with an explainer (e. g., an occlusion map byZeiler & Fergus, 2014
) that classifies the MNIST colorblind images. Similar to S, the expert E has to provide a prediction and an explanation that highlights where in the image the numeral can be found. To fulfill this task, E must have normal color vision. In contrast, the lead expert L must have aslight color deficit such that it is difficult for L to see the numeral in a short amount of time—Ishihara (1972) specified that humans with normal color vision must see the numeral within 3 s, whereas humans with a slight color deficit need long exposures to see the numeral. The approval criterion is that L must only accept solutions if L can see the predicted number in the input, which is possible for L to evaluate because L is chosen to have only a slight color deficit.
In the first run of the experiment, solutions without explanations are presented. Because L has a color deficit, the acceptance rates for a short decision time will be low for both E and S.333Given a decision time, each accepted solution where the decision took longer will be counted as rejection internally. In the second run, each solution includes an explanation. If the explanation is human-understandable, it will help L see the numeral so that the acceptance rates for a short decision time will increase. Therefore, by computing the differences between the run with and without explanation for a short decision time the usefulness of an explanation can be assessed because without explanations, L needs a longer time to evaluate task-solution pairs (L cannot circumvent the need for explanations to achieve short decision times since L needs long exposures to solve the tasks). Moreover, by comparing the acceptance rates, the explanation quality of S compared to E can be assessed, and, by repeating the experiment with different explanation methods, the quality of explanation methods can be quantified.
3 Related Work
The assessment framework we propose builds on the idea of the Feigenbaum test (Feigenbaum, 2003), which is a refinement of the Turing test (Turing, 1950), where the test is set up as a game that is played between experts of a particular (narrow) domain. In this game, a judging domain expert poses, for instance, problems, questions, or theories, which are passed on via two channels to either a computer or another domain expert. The judging domain expert does not know which channel connects to the computer. Depending on the channel, either the computer or the other domain expert replies with an answer. The test asks the following question: by evaluating the received answer, can the domain expert determine which channel connects to the computer? Similar to the Turing test, the Feigenbaum test is a behavioral test that tries to “test the facet of quality of reasoning” (Feigenbaum, 2003, p. 36). For a computer program to pass the test, it must be able to simulate human intelligent behavior, which is why the test is sometimes inappropriately taken as a test of human intelligence. We follow the idea of performing an experiment between experts of a certain domain but modify it by proposing a framework where the chances of accepting a solution (answer) from the machine and the human expert are measured. Consequently, the proposed framework is not a test that can be passed, but rather an assessment of solutions for domain-specific tasks so that a computer’s performance can be quantified in comparison with human performance.
To quantify whether or not explanations are human-like, Biessmann and Treu (2021) created a Turing test for transparency to evaluate whether humans can identify who generated an explanation (an AI or a human). Since they draw inspiration from the Turing test, this concept is similar to our framework. However, our goal is to assesses any performance of an AI system and a human expert—not only how human-like explanations are. Furthermore, their framework requires the interrogator to be informed about the presence of an AI and a human so that the interrogator may be biased against the AI (Dietvorst et al., 2015). Our proposal avoids this potential bias.
Other concepts to evaluate explanations is simulatability (Doshi-Velez & Kim, 2017, given the input and the corresponding explanation, the model output has to be predicted), and the Meta-predictor (Fel et al., 2021, after a training phase, humans have to predict the model output only by seeing the input). Hase and Bansal (2020) performed controlled experiments to measure simulatability, which is conceptually similar to the work of Fel et al. (2021). Based on the results, in both experiments, the authors concluded that some explainability methods help users. Similar to our proposed framework, both require two trials (with and without explanation) to measure the usefulness of an explainability method. But, with both concepts it is not possible to analyze whether a model is judged to be bad due to superhuman model capabilities, since the concepts are limited to the mental abilities of the human subjects who have to simulate the model behavior.
Alufaisan et al. (2021) also performed an experiment to evaluate the impact of explanations to help users perform a prediction and concluded that explanations do not positively impact prediction accuracy of humans. However, this result could be affected by uncontrolled confounders like asking the users for a prediction and giving them the freedom to ignore the AI outputs, which is resolved in our framework.
4 Conclusion and Outlook
The growing field of explainable AI still has no unified evaluation framework for explainability methods. Based on the contributions of several other frameworks and their experiments, we proposed an assessment framework that combines several of these approaches and addresses their weaknesses. Notably, the proposed framework is human-centric and able to identify models with superhuman performances because it always compares the AI performance with a human baseline performance. To demonstrate the generalizability of the framework, we have described two instantiations: the first measures classification accuracy, and the second measures the usefulness of human understandable explanations. The next steps will be the implementation of the second experiment.
Alufaisan et al. (2021)
Alufaisan, Y., Marusich, L. R., Bakdash, J. Z., Zhou, Y., and Kantarcioglu, M.
Does explainable artificial intelligence improve human decision-making?Proceedings of the AAAI Conference on Artificial Intelligence – AAAI 2021, 35(8):6618–6626, 2021.
- Biessmann & Treu (2021) Biessmann, F. and Treu, V. A Turing test for transparency. arXiv preprint arXiv:2106.11394, 2021.
- Dietvorst et al. (2015) Dietvorst, B. J., Simmons, J. P., and Massey, C. Algorithm aversion: People erroneously avoid algorithms after seeing them err. Journal of Experimental Psychology: General, 144(1):114–126, 2015. doi: 10.1037/xge0000033.
- Doshi-Velez & Kim (2017) Doshi-Velez, F. and Kim, B. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017.
- Feigenbaum (2003) Feigenbaum, E. A. Some challenges and grand challenges for computational intelligence. Journal of the ACM, 50(1):32–40, 2003. doi: 10.1145/602382.602400.
- Fel et al. (2021) Fel, T., Colin, J., Cadene, R., and Serre, T. What I cannot predict, I do not understand: A human-centered evaluation framework for explainability methods. arXiv preprint arXiv:2112.04417, 2021.
- Hase & Bansal (2020) Hase, P. and Bansal, M. Evaluating explainable AI: Which algorithmic explanations help users predict model behavior? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5540–5552, Online, 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.491.
- Ishihara (1972) Ishihara, S. Test for colour-blindness. Kanehara Shuppan, 1972.
- Jumper et al. (2021) Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S. A. A., Ballard, A. J., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., Back, T., Petersen, S., Reiman, D., Clancy, E., Zielinski, M., Steinegger, M., Pacholska, M., Berghammer, T., Bodenstein, S., Silver, D., Vinyals, O., Senior, A. W., Kavukcuoglu, K., Kohli, P., and Hassabis, D. Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873):583–589, jul 2021. doi: 10.1038/s41586-021-03819-2.
- LeCun et al. (1998) LeCun, Y., Cortes, C., and Burges, C. J. The MNIST database of handwritten digits. 1998. http://yann.lecun.com/exdb/mnist/.
- Turing (1950) Turing, A. M. I.—COMPUTING MACHINERY AND INTELLIGENCE. Mind, LIX(236):433–460, 1950. doi: 10.1093/mind/lix.236.433.
Zeiler & Fergus (2014)
Zeiler, M. D. and Fergus, R.
Visualizing and understanding convolutional networks.
In Fleet, D. J., Pajdla, T., Schiele, B., and Tuytelaars, T. (eds.),
Proceedings of the 13th European Conference on Computer Vision – ECCV 2014, volume 8689 of the Lecture Notes in Computer Science, pp. 818–833, Zurich, Switzerland, 2014. Springer, Cham. doi: 10.1007/978-3-319-10590-1˙53.