Log In Sign Up

Leveraging Explanations in Interactive Machine Learning: An Overview

Explanations have gained an increasing level of interest in the AI and Machine Learning (ML) communities in order to improve model transparency and allow users to form a mental model of a trained ML model. However, explanations can go beyond this one way communication as a mechanism to elicit user control, because once users understand, they can then provide feedback. The goal of this paper is to present an overview of research where explanations are combined with interactive capabilities as a mean to learn new models from scratch and to edit and debug existing ones. To this end, we draw a conceptual map of the state-of-the-art, grouping relevant approaches based on their intended purpose and on how they structure the interaction, highlighting similarities and differences between them. We also discuss open research issues and outline possible directions forward, with the hope of spurring further research on this blooming research topic.


page 1

page 2

page 3

page 4


Counterfactual Explanations for Machine Learning: Challenges Revisited

Counterfactual explanations (CFEs) are an emerging technique under the u...

On Validating, Repairing and Refining Heuristic ML Explanations

Recent years have witnessed a fast-growing interest in computing explana...

The Explanatory Gap in Algorithmic News Curation

Considering the large amount of available content, social media platform...

A Typology to Explore and Guide Explanatory Interactive Machine Learning

Recently, more and more eXplanatory Interactive machine Learning (XIL) m...

User Driven Model Adjustment via Boolean Rule Explanations

AI solutions are heavily dependant on the quality and accuracy of the in...

Intuitively Assessing ML Model Reliability through Example-Based Explanations and Editing Model Inputs

Interpretability methods aim to help users build trust in and understand...

1 Introduction

The fields of eXplainable Artificial Intelligence (XAI) and Interactive Machine Learning (IML) have traditionally been explored separately. On the one hand, XAI aims at making AI and Machine Learning (ML) systems more transparent and understandable, chiefly by equipping them with algorithms for explaining their own decisions 

[guidotti2018survey, ras2022explainable]. Such explanations are instrumental for enabling stakeholders to inspect the system’s knowledge and reasoning patterns, however stakeholders only participate as passive observers and have no control over the system or its behavior. On the other hand, IML focuses primarily on communication between machines and humans, and it is specifically concerned with eliciting and incorporating human feedback into the training process via intelligent user interfaces [fails2003interactive, amershi2014power, michael2020interactive, WARE2001281, HE20169, WinNT]. IML covers a broad range of techniques for in-the-loop interaction between humans and machines, however, most research does not explicitly consider explanations.

Recently, a number of works have sought integrating techniques from XAI within the IML loop. The core observation behind this line of research is that, interacting through explanations is an elegant and human-centric solution to the problem of acquiring rich human feedback, and therefore leads to higher-quality AI and ML systems, in a manner that is effective and transparent for both users and machines. In order to accomplish this vision, these works leverage either machine explanations obtained using techniques from XAI, human explanations provided as feedback by sufficiently expert annotators, or both, to define and implement a suitable interaction protocol.

These two types of explanations play different roles. By observing the machine’s explanations, users have the opportunity of building a better understanding of the machine’s overall logic, which not only facilitates trust calibration [amershi2014power], but also supports and amplifies our natural capacity of providing appropriate feedback [kulesza2015principles]. Machine explanations are also key for identifying imperfections and bugs affecting ML models, such as reliance on confounded features that are not causally related with the desired outcome [lapuschkin2019unmasking, geirhos2020shortcut, schramowski2020making]. At the same time, human explanations are a very rich source of supervision for models [camburu2018snli] and are also very natural for us to provide. In fact, explanations tap directly into our innate learning and teaching abilities [lombrozo2006structure, mac2018teaching] and are often spontaneously provided by human annotators when given the chance [stumpf2007toward]. Machine explanations and human feedback can also be combined to build interactive model editing and debugging facilities, because – once aware of limitations or bugs in the model – users can indicate effective improvements and supply corrective feedback [kulesza2015principles, teso2019explanatory].

The goal of this paper is to provide a general overview and perspective of research on leveraging explanations in IML. Our contribution is two-fold. First, we survey a number of relevant approaches, grouping them based on their intended purpose and how they structure the interaction. Our second contribution is a discussion of open research issues, on both the algorithmic and human sides of the problem, and possible directions forward. This paper is not meant as an exhaustive commentary of all related work on explainability or on interactivity. Rather, we aim to offer a conceptual guide for practitioners to understand core principles and key approaches in this flourishing research area. In contrast to existing overviews on using explanations to guide learning algorithms [hase2021can], debug ML models [lertvittayakumjorn2021explanation], and transparently recommend products [zhang2020explainable], we specifically focus on human-in-the-loop scenarios and consider a broader range of applications, highlighting the variety of mechanisms and protocols for producing and eliciting explanations in IML.


This paper is structured as follows. In Section 2 we discuss a general recipe for integrating explanations into interactive ML, recall core principles for designing successful interaction strategies, and introduce a classification of existing approaches. Then we discuss key approaches in more detail, organizing them based on their intended purpose and the kind of machine explanations they leverage. Specifically, we survey methods for debugging ML models using saliency maps and other local explanations in Section 3, for editing models using global explanations (e.g., rules) in Section 4, and for learning and debugging ML models with concept-level explanations in Section 5. Finally, we outline remaining open problems in Section 6 and related topics in Section 7.

2 Explanations in Interactive Machine Learning

Job: ConsultantSalary: kAge:


Input Attribution Example Attribution Rule Concept Attribution
Importance of Similarity to Approve if Importance of
   Job:    Job: Consultant    (Salary > k AND    High Earner:
   Salary:    Salary: k      Age > ) OR    Professional:
   Age:    Age:    (Age > AND    Sporty:
   Approve      Job Consultant)    
Figure 1: Illustration of how various kinds of explanations support human understanding in the context of loan requests. From left to right: input attributions and example attributions (discussed in Section 3), rules (discussed in Section 4), and concept attributions (discussed in Section 5).

Interactive Machine Learning (IML) stands for the design and the implementation of algorithms and user interfaces in order to engage users to actually build ML models. This stands in contrast to standard procedures, in which building a model is a fully automated process and domain experts have little control beyond data preparation [WARE2001281]. Research in IML explores ways to learn and manipulate models through an intuitive human-computer interface [michael2020interactive]

and encompasses a variety of learning and interaction strategies. Perhaps the most well-known IML framework is active learning 

[settles2012active, herde2021survey], which tackles learning high-performance predictors in settings in which supervision is expensive. To this end, active learning algorithms interleave acquiring labels of carefully selected unlabeled instances from an annotator and model updates. As another brief example, consider a recommender solution in a domain like movies, videos, or music, where the user can provide explicit feedback through rating the recommended items [HE20169]. These ratings can then be infused to the recommendation model’s decision making process so as to tailor the recommendations towards end users interests.

In IML, users are encouraged to shape the decision making process of the model, so it is important for users to build a correct mental model of the “intelligent agent”, which will then allow them to seamlessly interact with it. Conversely, since user feedback and involvement are so central, uninformed feedback risks wasting annotation effort and ultimately compromising model quality. Not all approaches to IML are equally concerned with user understanding. For instance, in active learning the machine is essentially a black-box, with no information being disclosed about what knowledge it has acquired and what effect feedback has on it [teso2019explanatory]. Strategies for interactive customization of ML models, like the one proposed by fails2003interactive, are less opaque, in that users can explore the impact of their changes and tune their feedback accordingly. Yet, the model’s logic can only be (approximately) reconstructed from changes in behavior, making it hard to anticipate what information should be provided to guide the model in a desirable direction [kulesza2015principles]. Following kulesza2015principles, we argue that proper interaction requires transparency and an understanding of the underlying model’s logic. And it is exactly here that explanations can be used to facilitate this process.

2.1 A General Approach for Leveraging Explanations in Interaction

In order to appreciate the potential roles played by explanations, it is instructive to look at explanatory debugging [kulesza2010explanatory, kulesza2015principles], the first framework to explicitly leverage them in IML, and specifically at EluciDebug [kulesza2015principles], its proof-of-concept implementation.

EluciDebug is tailored for interactive customization of Naïve Bayes classifiers in the context of email categorization. To this end, it presents users with explanations that illustrate the relative contributions of the model’s prior and likelihood towards the class probabilities. In particular, its explanations convey what

words the model uses to distinguish work emails from personal emails and how much they impact its decisions. In a second step, the user has the option of increasing or decreasing the relevance of certain input variables toward the choice of certain classes by directly adjusting the model weights. Continuing with our email example, the user is free to specify relevant words that the system is currently ignoring and irrelevant words that the system is wrongly relying on. This very precise form of feedback contrasts with traditional label-based strategies, in which the user might, e.g., tell the system that a message from her colleague about baseball is a personal (rather than work-related) communication, but has no direct control over what inputs the system decides to use. The responsibility of choosing what examples (e.g., wrongly classified emails) to provide feedback on is left to the user, and the interaction continues until she is happy with the system’s behavior. EluciDebug was shown to help users to better understand the system’s logic and to more quickly tailor it toward their needs [kulesza2015principles].

EluciDebug highlights how explanations contribute to both understanding and control, two key elements that will reoccur in all approaches we survey. We briefly unpack them in the following.


By observing the machine’s explanations, users get the opportunity of building a better understanding of the machine’s overall logic. This is instrumental in uncovering limitations and flaws in the model [VILONE202189]. As a brief example, consider a recommender solution in a domain like movies, videos, or music, where the users are presented with explanations in the form of a list of features that are found to be most relevant to the users’ previous choices [Tintarev2007]. Upon observing this information, users can see the assumptions the underlying recommender has made for their interests and preferences. It might be the case that the model made incorrect assumptions for the users’ preferences possibly due to some changes of interests which is not explicitly available in the data. In such a scenario, explanations provide a perfect ground for understanding the underlying model’s behavior.

Explanations are also instrumental for identifying models that rely on confounds in the training data, such as watermarks in images, that happen to correlate with the desired outcome but that are not causal for it [lapuschkin2019unmasking, geirhos2020shortcut]. Despite achieving high accuracy during training, these models generalize poorly to real-world data where the confound is absent. Such buggy behavior can affect high-stakes applications like COVID-19 diagnosis [degrave2021ai] and scientific analysis [schramowski2020making], and cannot be easily identified using standard evaluation procedures without explanations. Ideally, the users would develop a structural mental model that gives them a deep understanding of how the model operates, however a functional understanding is often enough for them to be able to interact [sokol2020one]. The ability of disclosing issues with the model, in turn, facilitates trust calibration [amershi2014power]. This is especially true in interactive settings as here the user can witness how the model evolves over time, another important factor that contributes to trust [waytz2014mind, wang2016trust].


Understanding supports and amplifies our natural capacity of providing appropriate feedback [kulesza2015principles]: once bugs and limitations are identified, interaction with the model enables end-users to modify the algorithm in order to correct those flaws. Bi-directional communication between users and machines together with transparency enables directability, that is, the ability to rapidly assert control or influence when things go astray [hoffman2013trust]. Clearly, control is fundamental if one plans to take actions based on a model’s prediction, or to deploy a new model in the first place [ribeiro2016should]. At the same time, directability also contributes to trust allocation [hoffman2013trust]. The increased level of control can also help to achieve significant gains in the end user’s satisfaction.

Human feedback can come in many forms, and one of these forms is explanations, either from scratch or by using the machine’s explanations as a starting point [teso2019explanatory]. This type of supervision is very informative: a handful of explanations are oftentimes worth many labels, substantially reducing the sample complexity of learning (well-behaved) models [camburu2018snli]. Importantly, it is also very natural for human to provide as explanations lie at the heart of human communication and tap directly into our innate learning and teaching abilities [lombrozo2006structure, mac2018teaching]. In fact, stumpf2007toward showed that when given the chance to provide free-form annotations, users had no difficulty providing generous amounts of feedback.


To ground the exchange of explanations between an end user and a model, kulesza2010explanatory presented a set of key principles around explainability and correctability. Although the principles are discussed in the context of explanatory debugging, they apply to all the approaches that are presented in this paper. These include: () Being iterative, so as to enable end-users to build a reasonable and informed model of the machine’s behavior. () Presenting sound, faithful explanations that do not over-simplify the model’s reasoning process. () Providing as complete a picture of the model as possible, without omitting elements that play an important role in its decision process. () Avoiding to overwhelming the user, as this complicates understanding and feedback construction. () Ensuring that explanations are actionable, making them engaging for users to attend to, thus encouraging understanding while enabling users to adjust them to their expertise. () Making user changes easily reversible. () Always honoring feedback, because when feedback is disregarded users may stop bothering to interact with the system. () Making sure to effectively communicate what effects feedback has on the model. Clearly there is a tension behind these principles, but they nonetheless are useful in guiding the design of explanation-based interaction protocols. As we will discuss in Section 6, although existing approaches attempt to satisfy one or more of these desiderata, no general method yet exists that satisfies all of them.

Goal Explanations Feedback Incorporation Method
Learning Local, CA Adjust Feature Association Update model w/ auxiliary loss lage2020learning
Adjust Encodings Update model w/ auxiliary loss stammer2022interactive
Debugging Local, IA Adjust Parameters Update model w/ improved parameters EluciDebug [kulesza2015principles]
Adjust Attributions Update data CAIPI [teso2019explanatory]
Update model w/ auxiliary loss RRR [schramowski2020making, ross2017right]
Update model w/ auxiliary loss teso2019toward
Local, EA Additional Features Update model w/ additional classifiers ALICE [liang2020alice]
Adjust Attributes Update data Biswas2013
Example Similarity Update data HILDIF [zylberajch2021hildif]
Counter Examples Update data CINCER [teso2021interactive]
Local, CA Adjust Attributions Update model w/ auxiliary loss RRC [stammer2021right]
Update model w/ auxiliary loss bontempelli2021toward
Update model w/ auxiliary loss ProtoPDebug [bontempelli2022concept]
Sample Pairing Update model w/ auxiliary loss Shao et al. [shao2022right]
Global, Rules Adjust Attributions Update model w/ hard constraint FIND [lertvittayakumjorn2020find]
Update model w/ auxiliary loss REMOTE [yao2021refining]
Counter Examples Update data XGL [popordanoska2020machine]
Editing Global, Rules Rule Editing Update data FROTE [alkan2022frote]
Post-processing Overlay [daly2021user]
Post-processing XIML [Lijie2022]
Adjust Feature Association Update model w/ auxiliary loss Antognini2021InteractingWE
Post-processing Alkan2019, IRF_CSCW_2021
Table 1: Table of methods covered in this overview. We here differentiate the various methods in their algorithmic goals (Goal), the type of explanations they consume (Explanations), the type of feedback provided by the user (Feedback), and, finally, the strategy of incorporating the explanatory user feedback (Incorporation). Abbreviations: CA concept attribution, EA example attribution, IA input attribution.

2.2 Dimensions of Explanations in Interactive Machine Learning

Identifying interaction and explainability as two key capabilities of a well performing and trust-worthy ML system, motivates us to layout this overview on leveraging explanations in interactive ML. The methods we survey tackle different applications using a wide variety of strategies. In order to identify common themes and highlight differences, we organize them along four dimensions:

Algorithmic goal: We identify three high-level scenarios. One is that of using explanation-based feedback, optionally accompanied by other forms of supervision, to learn an ML model from scratch. Here, the machine is typically in charge of asking appropriate questions, feedback may be imperfect, and the model is updated incrementally as feedback is received. Another scenario is model editing, in which domain experts are in charge of inspecting the internals of a (partially) trained model (either directly if the model is white-box or indirectly through its explanations) and can manipulate them to improve and expand the model. Here feedback is typically assumed high-quality and used to constrain the model’s behavior. The last scenario is debugging, where the focus is on fixing issues of misbehaving (typically black-box) models and the machine’s explanations are used to both spot bugs and elicit corrective feedback. Naturally, there is some overlap between goals. Still, we opt to keep them separate as they are often tackled using different algorithmic and interaction strategies.

Type of machine explanations: The approaches we survey integrate four kinds of machine explanations: input attributions (IAs), example attributions (EAs), rules, and concept attributions (CAs). IAs identify those input variables that are responsible for a given decision, and as such they are local in nature. EAs and CAs are also local, but justify decisions in terms of relevant training examples and high-level concepts, respectively. At the other end of the spectrum, rules are global explanations in that they aim to summarize, in an interpretable manner, the logic of a whole model. These four types of explanations are illustrated in Fig. 1 and described in more detail in the next sections

Type of human feedback and incorporation strategy: Algorithmic goal and choice of machine explanations act as a determiner for the types of interactions that can happen, in turn affecting two other important dimensions, namely the type of feedback that can be collected and the way the machine can consume this feedback [narayanan2018humans]. Feedback ranges from updated parameter values, as in EluciDebug, to additional data points, to gold standard explanations supplied by domain experts. Incorporation strategies go hand-in-hand, and range from updating the model’s parameters as instructed to (incrementally) retraining the model, perhaps including additional loss terms to incorporate explanatory feedback. All details are given in the following sections.

The methods we survey are listed in Table 1. Notice that the two most critical dimensions, namely algorithmic goal and type of machine explanations, are tightly correlated: learning approaches tend to rely on local explanations and editing approaches on rules, while debugging approaches employ both. For this reason, we chose to structure the next three sections by explanation type. One final remark before proceeding. Some of the approaches we cover rely on choosing specific instances or examples to be presented to the annotator. Among them, some rely on machine-initiated interaction, in the sense that they leave this choice to the machine (for instance, methods grounded on active learning tend to pick specific instances that the model is uncertain about [settles2012active]), while others rely on human-initiated interaction and expect the user to pick instances of interest from a (larger) set of options. We do not group approaches based on this distinction, so as to keep our categorization manageable. The specific type of interaction used will be made clear in the following sections on a per-method basis.

3 Interacting via Local Explanations

In this section, we discuss IML approaches that rely on local explanations to carry out interactive model debugging. Despite sharing some aspects with EluciDebug [kulesza2015principles], these approaches exploit modern XAI techniques to support state-of-the-art ML models and explore alternative interaction protocols. Before reviewing them, we briefly summarize those types of local explanations that they build on.

3.1 Input Attributions and Example Attributions

Given a classifier and a target decision, local explanations identify a subset of “explanatory variables” that are most responsible for the observed outcome. Different types of local explanations differ in what variables they consider and in how they define responsibility.

Input attributions, also known as saliency maps, convey information about relevant vs. irrelevant input variables. For instance, in loan request approval an input attribution might report the relative importance of variables like Job, Salary and Age of the applicant, as illustrated in Fig. 1, and in image tagging that of subsets of pixels, as shown in Fig. 2 (left). A variety of attribution algorithms have been developed. Gradient-based approaches like Input Gradients (IGs) [baehrens2010explain, simonyan14deep], GradCAM [selvaraju2017grad], and Integrated Gradients [sundararajan2017axiomatic] construct a saliency map by measuring how sensitive the score or probability assigned by the classifer to the decision is to perturbations of individual input variables. This information is read off from the gradient of the model’s output with respect to its input, and as such it is only meaningful if the model is differentiable and the inputs are continuous (e.g., images). Sampling-based approaches like LIME [ribeiro2016should] and SHAP [vstrumbelj2014explaining, lundberg2017unified] capture similar information [garreau2020explaining], but they rely on sampling techniques that are applicable also to non-differentiable models and to categorical and tabular data. For instance, LIME distills an interpretable surrogate model using random samples labeled by the black-box predictor and then extracts input relevance information from the surrogate. Despite their wide applicability, these approaches tend to be more computationally expensive than gradient-based alternatives [van2021tractability]

and, due to variance inherent in the sampling step, in some cases their explanations may portray an imprecise picture of the model’s decision making process 

[zhang2019should, teso2019toward]. One last group of approaches focus on identifying the smallest subset of input variables whose value, once fixed, ensures that the model’s output remains the same regardless of the value taken by the remaining variables [shih2018symbolic, wang2020towards, camburu2020struggles], and typically come with faithfulness guarantees. Although principled, these approaches however have not yet been used in explanatory interaction.

Example attributions, on the other hand, explain a target decisions in terms of those training examples that most contributed to it, and are especially natural in settings, like medical diagnosis, in which domain experts are trained to perform case-based reasoning [bien2011prototype, kim2014bayesian, chen2019looks]. For instance, in Fig. 1 a loan application is approved by the machine because it is similar to an previously approved application and in Fig. 2 (right) a mislabeled training image fools the model into mispredicting a husky dog as a wolf. For same classes of models, like nearest neighbor classifiers and prototype-based predictors, example relevance can be easily obtained. For all other models, it can in principle be evaluated by removing the example under examination from the training set and checking how this changes the models’ prediction upon retraining. This naïve solution however scales poorly, especially for larger models for which retraining is time consuming. A more convenient alternative are Influence Functions (IFs), which offer an efficient strategy for approximating example relevance without retraining [koh2017understanding], yielding a substantial speed-up. Evaluating IFs is however non-trivial, as it involves inverting the Hessian of the model, and can be sensitive to factors such as model complexity [basu2020influence] and noise [teso2021interactive]. This has prompted researchers to develop more scalable and robust algorithms for IFs [guo2020fastif] as well as alternative strategies to approximate example relevance [yeh2018representer, khanna2019interpreting].

Figure 2: Illustration of two explanation strategies based on local explanations. Left: The machine explains its predictions by highlighting relevant input variables – in this case, relevant pixels – and the user replies with an improved attribution map. Right: The machine justifies its predictions in terms of training examples that support them, and the user either corrects the associated label. In both cases, the data and model are aligned to the user’s feedback.

3.2 Interacting via Input Attributions

One line of work on integrating input attributions and interaction is eXplanatory Interactive Learning (XIL) [teso2019explanatory, schramowski2020making]. In the simplest case, XIL follows the standard active learning loop [settles2012active], except that whenever the machine queries the label of a query instance, it also presents a prediction for that instance and a local explanation

for the prediction. Contrary to EluciDebug, which is designed for interpretable classifiers, in XIL the model is usually a black-box, e.g., a support vector machine or a deep neural network, and explanations are extracted using input attribution methods like LIME 

[teso2019explanatory] or GradCAM [schramowski2020making]. At this point, the annotator supplies a label for the query instance – as in regular active learning – and, optionally, corrective feedback on the explanation. The user can, for instance, indicate what input variables the machine is wrongly relying on for making its prediction. Consider Fig. 2 (left): here the query instance depicts a husky dog, but the model wrongly classifies it as a wolf based on the presence of snow in the background (in blue), which happens to correlate with wolf images in the training data. To correct for this, the user indicates that the snow should not be used for prediction (in red).

XIL then aligns the model based on this corrective feedback. CAIPI [teso2019explanatory], the original implementation of XIL, achieves this using data augmentation. Specifically, CAIPI makes a few copies of the target instance (e.g., the husky image in Fig. 2) and then randomizes the input variables indicated as irrelevant by the user while leaving the label unchanged, yielding a small set of synthetic examples. These are then added to the training set and the model is retrained. Essentially, this teaches the model to classify the image correctly

without relying on the randomized variables

. Data augmentation proved effective at debugging shallow models for text classification and other tasks [teso2019explanatory, slany2022caipi] and at reducing labeling effort [slany2022caipi], at the cost of requiring extra space to store the synthetic examples.

A more refined version of CAIPI [schramowski2020making] solves this issue by introducing two improvements: LIME is replaced with GradCAM [selvaraju2017grad], thus avoiding sampling altogether, and the model is aligned using a generalization of the right for the right reasons (RRR) [ross2017right] modified to work with GradCAM. Essentially, the RRR loss penalizes the model proportionally to the relevance that its explanations assign to inputs that have been indicated as irrelevant by the user. Combining it with a regular loss for classification (e.g., the categorical cross-entropy loss) yields an end-to-end differentiable training pipeline that can be optimized using regular back-propagation. This in turn leads to shorter training times, especially for larger models, without the need for extra storage space. This approach was empirically shown to successfully avoid Clever Hans behavior in deep neural networks used for hyperspectral analysis of plant phenotyping data [schramowski2020making].

Several alternatives to the RRR loss have been developed. The Contextual Decomposition Explanation Penalization (CDEP) [rieger2020interpretations] follows the same recipe, but it builds on Contextual Decomposition [singh2018hierarchical], an attribution technique that takes relationships between input variables into account. The approach of yao2021refining enables annotators to dynamically explore the space of feature interactions and fine-tune the model accordingly. The Right for Better Reasons (RBR) loss [shao2021right] improves on gradient-based attributions by replacing input gradients with their influence function [koh2017understanding], and it was shown to be more robust to changes in the model and speed up convergence. Human Importance-aware Network Tuning (HINT) [selvaraju2019taking] takes a different route in that it rewards the model for activating on inputs deemed relevant by the annotator. Finally, teso2019toward introduced a ranking loss designed to work with partial and possibly sub-optimal explanation corrections. These methods have recently been compared in the context of XIL by friedrich2022typology. There, the authors introduce a set of benchmarking metrics and tasks and conclude that the “no free lunch” theorem [wolpert1997no] also holds for XIL, i.e., no method exceeds on all evaluations.

ALICE [liang2020alice] also augments active learning, but it relies on contrastive explanations. In each interaction round, the machine selects a handful of class pairs, focusing on classes that the model cannot discriminate well, and for each of them asks an annotator to provide a natural language description of what features enable them to distinguish between the two classes. It then uses semantic parsing to convert the feedback into rules and integrates it by “morphing” the model architecture accordingly. parkash2012attributes, Biswas2013, on the other hand, enable users to specify what attributes make an instance a negative, and use them to acquire negative examples using pre-trained attribute classifiers.

FIND is an alternative approach for interactively debugging models for natural language processing tasks 

[lertvittayakumjorn2020find]. What sets it apart is that interaction is framed in terms of sets of local explanations. FIND builds on the observation that, by construction, local explanations fail to capture how the model behaves in regions far away from the instances for which the user receives explanations. This, in turn, complicates acquiring high-quality supervision and allocating trust [wu2019local, popordanoska2020machine]. FIND addresses this issue by extracting those words and -grams that best characterize each latent feature acquired by the model, and then visualizing the relationship between words and features using a word cloud. The characteristic words are obtained using layer-wise relevance propagation [bach2015pixel], a technique akin to input gradients, to all examples in the training set. Based on this information, the user can turn off those latent features that are not relevant for the predictive task. For instance, if a model has learned a latent feature that strongly correlates with a polar word like “love” and uses it to categorize documents into non-polar classes such as spam and non-spam, FIND allows to instruct the model to no longer rely on this feature. This is achieved by introducing a hard constraint directly into the prediction process. Other strategies for overcoming the limits of explanation locality are discussed in Section 4.

3.3 Interacting via Example Attributions

Example attributions also have a role to play in interactive debugging. Existing strategies aim to uncover and correct cases where a model relies on “bad” training examples for its predictions, but target different types bugs and elicit different types of feedback. HILDIF [zylberajch2021hildif] uses a fast approximation of influence functions [guo2020fastif] to identify examples in support of a target prediction and and then asks a human-in-the-loop to verify whether the their level of influence is justified. It then calibrates the influence of these examples on the model via data augmentation. Specifically, HILDIF tackles NLP tasks and it augments those training examples that are most relevant, as determined by the user, by replacing words by synonyms, effectively boosting their relative influence compared to the others. In this sense, the general idea is reminiscent of XIL, although viewed from the lens of example influence rather than attribute relevance.

A different kind of bug occurs when a model relies on mislabeled examples, which are frequently encounteded in applications, especially when interacting with (even expert) human annotators [frenay2014classification]. CINCER [teso2021interactive] offers a direct way to deal with this issue in sequential learning tasks. To this end, in CINCER the machine monitors incoming examples and asks a user to double-check and re-label those examples that look suspicious [zeni2019fixing, bontempelli2020learning]. A major issue is that the machine’s skepticism may be wrong due to – again – presence of mislabeled training examples. This begs the question: is the machine skeptical for the right reasons? CINCER solves this issue by identifying those training examples that most support the model’s skepticism using IFs, giving the option of relabeling the suspicious incoming example, the influential examples, or both. This process is complementary to HILDIF, as it enables stakeholders to gradually improve the quality of the training data itself, and – as a beneficial side-effect – also to calibrate its influence on the model.

3.4 Benefits and Limitations

Some of the model alignment strategies employed by the approaches overviewed in this section were born for offline alignment task [ross2017right, ghaeini2019saliency, selvaraju2019taking, hase2021can]. A major issue of this setting is that it is not clear where the ground-truth supervision comes from, as most existing data sets do not come with dense (e.g., pixel- or word-level) relevance annotations. In interactive the settings we consider, this information is naturally provided by a human annotator. Another important advantage is that in interactive settings users are only required to supply feedback tailored for those buggy behaviors exhibited by the model, dramatically reducing the annotation burden. The benefits and potential drawbacks of explanatory interaction are studied in detail by ghai2021explainable.

4 Interacting via Global Explanations

All explanations discussed so far are local, in the sense that they explain a given decision of a specific input . Local explanations enable the user to build a mental model of individual predictions, however bringing together the information gleaned by examining a number of individual predictions may prove challenging to get an overall understanding of a model. Lifting this restriction immediately gives regional explanations, which are guaranteed to be valid in a larger, well-defined region around

. For instance, the explanations output by decision trees are regional in this sense. Global explanations aim to describe the behavior of

across all of its domain [guidotti2018survey] providing an approximate overview of the model [craven1995extracting, deng2019interpreting]. One approach is to train a directly interpretable model such as a decision tree or a rule set, using the same training data optimised for the interpretable model to behave similarly to the original model. This provides a surrogate white-box model of which can then be used as an approximate global map of  [guidotti2018survey, lundberg2020local]. Other approaches start with local or regional explanations and merge them to provide a global explainer [lundberg2020local, setzu2021glocalx].

Figure 3: Illustration of rule-based explanation feedback. Left: Rule-based explanations shown to the user to support the prediction. Right: The input instance does not satisfy any rules to support approve, so the clauses the input instance violates can be shown so the user can validate whether those reasons should be upheld.

4.1 Rule-based Explanations

An example rule-based explanation on the loan request approval could be “Approve if AND ”, as shown in Fig. 1 and Fig. 3. Rule-based explanations decompose the models predictions into simple atomic elements either via decision trees, rule lists or rule sets. While decision trees are similar to rule lists and sets in terms of how they logically represent the decision processes, it is not always the case that decision trees are interpretable from a user perspective, particularly when the number of nodes are large [narayanan2018humans]. As a result, for explanations to facilitate user interaction, the size and complexity of the number of rules and clauses need to be considered.

LORE [guidotti2018local]

is an example of a local rule-based explainer, which builds an interpretable predictor by first generating a balanced set of neighbour instances of a given data instance through an ad-hoc genetic algorithm, and then building a decision tree classifier out of this set. From this classifier, a local explanation is then extracted, where the local explanation is a pair of logical rules, corresponding to the path in the tree and a set of counterfactual rules, explaining which conditions need to be changed for the data instance to be assigned the inverted class label by the underlying ML model. Both LIME 

[ribeiro2016should], and Anchors [ribeiro2018anchors] are other examples which uses a similar intuition such that, even if the decision boundary for the black box ML model can be arbitrarily complex over the whole data space for a given data instance, there is a high chance that the decision boundary is clear and simple in the neighborhood of a data point which can be captured by an interpretable model. Nanfack et al. learn a global decision rules explainer using a co-learning approach to distill the knowledge of a blackbox model into the decision rules [nanfack21a].

Rule-based surrogates have the advantage of being interpretable however, in order to achieve coverage, the model must add rules to cover increasingly narrow slices which can in turn negatively impact interpretability. BRCG [dash2018boolean] and GLocalX [setzu2021glocalx] trade-off fidelity with interpretability along with compactness of the rules to produce a rule-based representation of the model, which makes them more appropriate as the basis for user feedback.

4.2 Interacting via Rule-based Explanations

Rule-based explanations are logical representations of a machine learning model’s decision making process which have the benefit of being interpretable to the user. This logical representation can then be modified by end users or domain experts, and these edits bring the advantage that the expected behaviour is more predictable to the end user. Another key advantage of rule-based surrogate models is that, they provide a global representation of the underlying model rather than local explanations which make it more difficult for a user to build up an overall mental model of the system. Additionally, if the local explanations presented to the user to learn and understand the machine learning model are not well distributed, they may create a biased view of the model, leading users to trust a model that for some regions has inaccurate logic.


produce decision sets where the decision set effectively becomes the predictive model. Popordanoska et al. supports user feedback by generating a rule-based system from data that the user may modify which is then used as a rule-based executable model


daly2021user present an algorithm which brings together a rule-based surrogate derived by dash2018boolean and an underlying machine learning model to support user provided modifications to existing rules in order to incorporate user feedback in a post-processing approach. Similar to CAIPI [teso2019explanatory, schramowski2020making] predictions and explanations are presented to experts, however the explanations are rule-based. The user can then specify changes to the labels and rules by adjusting the clauses of the explanation. For example, the user can modify a value such as change to , remove a clause such as or add a clause such as an additional feature requirement such as . The modified labels and clauses are stored together with transformations that map between original and feedback rules as an Overlay or post-processing approach to the existing model. In a counterfactual style approach, the transformation function is applied to relevant new instances presented to the system in order to understand if the user adjustment influences the final prediction. One advantage of the combined approach is that the rule-based surrogate does not need to encode the full complexity of the model as the underlying prediction still comes from the ML model. The rule-based explanations are used as the source of feedback to provide corrections to key variables. Results showed this approach can support updates and edits to an ML model without retraining as a ’band-aid’, however once the intended behaviour diverges too much from the underlying model, results deteriorate. alkan2022frote similarly use rules as the unit for feedback where the input training data is pre-processed in order to produce an ML model that aligns with the user provided feedback rules [alkan2022frote]. The FROTE algorithm generates synthetic instances that reflect both the feedback rules as well as the existing data and has the advantage of encoding the user feedback into the model. As with the previous solution, the advantage is the rules do not need to reflect the entire model, but can focus on the regions where feedback or correction is needed. The results showed that the solution can support modifications to the ML model even when they diverge quite significantly from that of the existing model. ratner2017snorkel combines data-sources to produce weak labels and one source of labels they consider are user provided rules or labelling functions which can then be tuned. A recent work [Lijie2022] explored an explanation-driven interactive machine learning (XIML) system with the Tic-Tac-Toe game as a use case to understand how an XIML mechanism effects users’ satisfaction with the underlying ML solution. A rule based explainer, BRCG [dash2018boolean], is used for generating explanations, and authors experimented on different modalities to support user feedback through visual or rule-based corrections. They conducted a user study in order to test the effect of allowing users to interact with the rule-based system through creating new rules or editing or removing existing rules. The results of the user study demonstrated that allowing interactivity within the designed XIML system leads to increased satisfaction for the end users.

4.3 Benefits and limitations

An important advantage of leveraging rules to enable user feedback is that, editing a clause can impact many different data points which can aid in reducing the cognitive load for the end user and making the changes more predictable. When modifying a Boolean clause, the feedback and the intended consequences are clearer in comparison to alternative techniques such as re-weighting feature importance or a training data point. However, one challenge with rule based methods is that, feedback can be conflicting in nature, therefore some form of conflict resolution is needed to be implemented in order to allow experts to resolve collisions [alkan2022frote]. Additionally, rules can be probabilistic and eliciting such probabilities from experts can be difficult.

An additional challenge is that, most of the existing rule-based solutions only build upon original features in the data. However, recent work has started to consider this direction for example [alaa2019demystifying] produces a symbolic equation as a white-box model and [santurkar2021editing] supports editing concept based rules for image classification. In the next section, let us therefore continue with a branch of work that potentially allows for interactions on a local and global level via concept-based explanations.

5 Interacting Using Concept-Based Explanations

An advantage of input attributions is that they can be extracted from any black-box model without the need for retraining, thus leaving performance untouched. Critics of these approaches, however, have raised a number of important issues [rudin2019stop]. Perhaps the most fundamental ones, particularly from an interaction perspective, are the potential lack of faithfulness of post-hoc explanations and that input-based explanations are insufficient to accurately capture the reasons behind a model’s decision, particularly when these are abstract [stammer2021right, viviano2021saliency]. Consider an input attribution highlighting a red sports car: does the model’s prediction depend on the fact that a car is present, that it is a sports car, or that it is red? This lack of precision can severely complicate interaction and revision.

A possible solution to this issue is to make use of white-box models, which – by design – admit inspecting their whole reasoning process. Models in this category include, e.g., shallow decision trees [angelino17corel, rudin2019stop] and sparse linear models [ustun2016supersparse] based on human-understandable features. These models, however, do not generally support representation learning and struggle with sub-symbolic data.

A more recent solution are concept-based models (CBMs), which combine ideas from white and black-box models to achieve partial, selective interpretability. Since it is difficult – and impractical – to make every step in a decision process fully understandable, CBMs generally break down this task into two levels of processing: a bottom level, where one module (typically black-box) is used for extracting higher-level concept vectors

, with , from raw inputs, and a more transparent, top level module in which a decision is made based on the concepts alone. Most often, the top layer prediction is obtained by performing a weighted aggregation of the extracted concepts. Fig. 1 provides an example of such concept vectors extracted from the raw data in the context of loan requests. Such concept vectors are often of binary form, e.g. a person applying for a loan is either considered a professional or not. The explanation finally corresponds to importance values on these concept vectors.

CBMs combine two key properties. First, the prediction is (roughly) independent from the inputs given the concept vectors. Second, the concepts are chosen to be as human understandable as possible, either by designing them manually or through concept learning, potentially aided by additional concept-level supervision. Taken together, these properties make it possible to faithfully explain a CBMs predictions based on the concept representation alone, thus facilitating interpretability without giving up on representation learning. Another useful feature of CBMs is that they allow for test-time interventions to introspect and revise a model’s decision based on the individual concept activations [koh2020concept].

Research on CBMs has explored different representations for the higher-level concepts, including (i) autoencoder-based concepts obtained in an unsupervised fashion and possibly constrained to be independent and dissimilar from each other 

[alvarez2018towards]. (ii) prototype representations that encode concrete training examples or parts thereof [chen2019looks, hase2019interpretable, rymarczyk2020protopshare, nauta2021neural, barnett2021iaia], (iii) concepts that are explicitly aligned to concept-level supervision provided upfront [koh2020concept, chen2020concept], (iv) white-box concepts obtained by interactively eliciting feature-level dependencies from domain experts [lage2020learning].

The idea and potential of symbolic, concept-based representations is also found in neuro-symbolic models, although this branch of research was developed from a different standpoint than interpretability alone. Specifically, neuro-symbolic models have recently gained increased interest in the AI community [garcez2012neural, GarcezGLSST19, Yi0G0KT18, wagner2021neural] due to their advantages in terms of performance, flexibility and interpretability. The key idea is that these approaches combine handling of sub-symbolic representations with human-understandable, symbolic latent representations. Although it is still an open debate on whether neuro-symbolic approaches are ultimately preferable over purely subsymbolic or symbolic approaches, several recent works have focused on the improvements and richness of neuro-symbolic explanations in the context of understandability and the possibilities of interaction that go beyond the approaches of the previous sections. Notably, the distinction between CBMs and neuro-symbolic models can be quite fuzzy, with CBMs possibly considered as one category of neuro-symbolic AI [SarkerZEH21, YehKR21].

Although research on concept-level and neuro-symbolic explainability is a flourishing branch of research, it remains quite recent and only selected works have incorporated explanations in an interactive setting. In this section we wish to provide details on these works, but also mention noteworthy works that show potential for leveraging explanations in human machine interactions.

5.1 Interacting with Concept-based Models

Figure 4: Illustration of providing feedback on concept learning and concept aggregation strategies. Left: In concept learning a model learns a set of basic concepts that are present in the dataset. Hereby it learns to grounding specific features of the dataset on symbolic concept labels. Right: Given a set of known basic concepts a model could falsely aggregate the concept activations leading to a false final prediction. Users can provide feedback on the concept explanations.

Like all machine learning models, CBMs are also prone to erroneous behaviour and may require revision and debugging by human experts [bahadori2021debiasing]. The special structure of CBMs poses challenges that are specific to this setting. A major issue that arises in the context of CBMs is that not only the weights used in the aggregation of the concept activations can be faulty and require adjustment, but the concepts themselves – depending on how they are defined or acquired – can be insufficient, incorrect, or uninterpretable [lage2020learning, kambhampati21linguafranca].

These two steps have mostly been tackled separately. Fig. 4 gives a brief sketch of this where a human user can guide a model in learning the basic set of concepts from raw images (left), but also provide feedback on the concept aggregations, e.g. in case relevant concepts are ignored for the final class prediction (right).

Several works have tackled interactively learning concepts. For instance, lage2020learning focus on human-machine concept alignment, and propose an interactive loop in which a human expert directly guides concept learning. Specifically, the machine elicits potential dependencies between inputs and concepts by asking questions like, e.g., “does the input variable lorazepam contribute to the concept depression?”. This is reminiscent of FIND [lertvittayakumjorn2020find], but the queried dependencies are chosen by the machine so to be as informative as possible. In their recent work, stammer2022interactive propose to use prototype representations for interactively learning concepts and thereby grounding features of image data to discrete (symbolic) concept representations. Via these introspectable encodings a human user can guide the concept learning by directly giving feedback on the prototype activations or by providing paired samples that possess specific concepts which the model should learn. The sample pairing feedback is reminiscent of Shao et al. [shao2022right] who proposed debiasing generative models via weak supervision. bontempelli2022concept on the other hand propose debugging part-prototype networks via interactive supervision on the learned part-prototypes using a simple interface in which the user can determine signal that concept is valid or not valid for a particular precision using a single click. They also remark that this kind of concept-level feedback on concepts generalizes is very rich, in that it naturally generalizes to all instances that feature those concepts, facilitating debugging from few interaction rounds.

Other works focus on the concept aggregation step. For instance, teso2019toward applied explanatory interactive learning to self-explainable neural networks, enabling end-users to fine-tune the aggregation weights [alvarez2018towards]. An interesting connection to causal learning is made by bahadori2021debiasing, who present an alternative and principled approach for debiasing CBMs from confounders. These two strategies, however, assume the concepts to be given and fixed, which is often not the case.

Finally, bontempelli2021toward outline a unifying framework for debugging CBMs that clearly distinguishes between bugs in the concept set and bugs in the aggregation step, and advocate a multi-step procedure that encompasses determining where the source of the error lies and providing corrective supervision accordingly.

5.2 Interacting with Neuro-symbolic Models

Neuro-symbolic models, similar to CBMs, also support communicating with users in terms of symbolic, higher-level concepts as well as more general forms of constraints and knowledge. There are different ways in which these symbols can be presented and the interaction can be structured.

Ciravegna et al. [ciravegna2020human]

, propose to integrate deep learning with expressive human-driven first-order logic (FOL) explanations. Specifically, a neural network maps to a symbolic concept space and is followed by an additional network that learns to create FOL explanations from the latent concept space. Thanks to this direct translation into FOL explanations, it is in principle easy to integrate prior knowledge from expert users as constraints onto the model.

On the other hand Stammer et al. [stammer2021right] focus more on human-in-the-loop learning. With their work, the authors show that receiving explanations on the level of the raw input – as done by standard approaches presented in the previous sections – can be insufficient for removing Clever Hans behavior, and show how this problem can be solved by integrating and interacting with rich, neuro-symbolic explanations. Specifically, the authors incorporate a right for the right reason loss [ross2017right] on a neural reasoning module which receives multi-object concept activations as input. The approach allows for concept-level explanations, and user feedback is cast in terms of relational logic statements.

The benefits of symbolic explanations and feedback naturally extend from supervised learning and computer vision domains, as in the previous works, to settings like planning 


and, most recently, reinforcement learning (RL). In particular, in the context of deep RL, Guan et al. 

[guan2021widening] provide coarse symbolic feedback in the form of object-centric image regions to accompany binary feedback on an agent’s proposed actions. Another interesting use of symbolic explanations in RL is that of Zha et al. [zha21ambdemonstrations], in which an RL agent learns to better understand human demonstrations by grounding these in human-aligned symbolic representations.

5.3 Benefits and Limitations

The common underlying motivation of all these approaches is the potential of symbolic communication to improve both precision and bandwidth of explanatory interaction between (partially sub-symbolic) AI models and human stakeholders. A recent paper by Kambhampati et al. [kambhampati21linguafranca] provides an excellent motivation on the importance of this property as well as important remaining challenges. The main challenge of concept-based and neuro-symbolic models lies in identifying a set of basic concepts or symbols [YehKR21] and grounding a model’s latent representations on these. Though recent works, e.g. lage2020learning and stammer2022interactive, have started to tackle this it remains an important issue to solve particularly for real world data.

6 Open Problems

Despite recent progress on integrating explanations and interaction, many unresolved problems remain. In this section, we outline a selection of urgent open issues and, wherever possible, suggest possible directions forward, with the hope of spurring further research on this important topic.

6.1 Handling Human Factors

Machine explanations are only effective insomuch as they are understood by the person at the receiving end. Simply ensuring algorithmic transparency, which is perhaps the major focus in current XAI research, gives few guarantees, because understanding strongly depends on human factors such as mental skills and familiarity with the application domain [sokol2020one, liao2021human]. As a matter of fact, factual but ill-designed machine guidance can actively harm understanding [ai2021beneficial].

Perhaps the most critical element for successful explanation-based interaction requires is that the user and the machine to agree on the semantics of the explanations they exchange. This is however problematic, partly because conveying this information to users is non-trivial, and partly because said semantics are often unclear or brittle to begin with, as discussed in Sections 6.3 and 6.2. The literature does provide some guidance on what classes of explanations may be better suited for IML. Existing studies suggest that people find it easier to handle explanations that express concrete cases [kim2014bayesian] and that have a counterfactual flavour [wachter2017counterfactual], and that breaking larger computations into modules facilitates simulation [lage2019human], but more work is needed to implement and evaluate these suggestions in the context of explanation-based IML. Moreover, settings in which users have also to manipulate or correct the machine’s explanations, like interactive debugging, impose additional requirements. Another key issue is that of cognitive biases. For instance, human subjects tend to pay more attention to affirmative aspects of counterfactuals [byrne2019counterfactuals], while AIs have no such bias. Coping with these human factors requires to design appropriate interaction and incorporation strategies, and it is necessary for correct and robust operation of explanation-based IML.

We also remark that different stakeholders inevitably need different kinds of explanations [miller2019explanation, liao2021human]. A promising direction of research is to enable users to customize the machine’s explanations to their needs [sokol2020one, finzel2021explanation]. Challenges on this path include developing strategies for eliciting the necessary feedback and assisting users in exploring the space of alternatives.

6.2 Semantics and Faithfulness

Not all kinds of machine explanations are equally intelligible and not all XAI algorithms are equally reliable. For instance, some gradient-based attribution techniques fail to satisfy intuitive properties (like implementation invariance [sundararajan2017axiomatic]) or ignore information at the top layers of neural networks [adebayo2018sanity], while sampling-based alternatives may suffer from high variance [zhang2019should, teso2019toward]. A number of other issues have been identified in the literature [hooker19roar, kindermans2019reliability, adebayo20debugging, sixt2020explanations, kumar2020problems]. The semantics of transparent models is also not always well-defined. For instance, the coefficients of linear models are often viewed as capturing feature importance in an additive manner [ribeiro2016should], but this interpretation is only valid as long as the input features are independent, which is seldom the case in practice. Decision tree-based explanations have also received substantial scrutiny [izza2020explaining].

Another critical element is faithfulness. The reason is that bugs identified by unfaithful explanations may be artifacts in the explanation rather than actual issues with the model, meaning that asking users to correct them is not only deceptive, but also uninformative for the machine and ultimately wasteful [teso2019explanatory]. The distribution of machine explanations is equally important for ensuring faithfulness: individually faithful local explanations that fail to cover the whole range of machine behaviors may end up conveying an incomplete [lertvittayakumjorn2020find] and deceptively optimistic [popordanoska2020machine] picture of the model’s logic to stakeholders and annotators.

Still, some degree of unfaithfulness is inevitable, for both computational and cognitive reasons. On the one hand, interaction should not overwhelm the user [kulesza2015principles]. This entails presenting a necessarily simplified picture of the (potentially complex) inference process carried out by the machine. On the other hand, extracting faithful explanations often comes at a substantial computational cost [van2021tractability], which is especially problematic in interactive settings where excessive repeated delays in the interaction can estrange the end-user. Alas, more light-weight XAI strategies tend to rely on approximations and leverage less well-defined explanation classes.

6.3 Abstraction and Explanation Requirements

Many attribution methods are restricted to measuring relevance of individual input variables or training examples. In stark contrast, explanations used in human–human communication convey information at a more abstract, conceptual level, and as such enjoy improved expressive power. An important open research question is how to enable machines to communicate using such higher-level concepts, especially in the context of the approaches discussed in Section 5.

This immediately yields the issue of obtaining a relevant, user-understandable symbolic concept space [kambhampati21linguafranca, rudin2022challenges]. This is highly non-trivial. In many cases it might not be obvious what the relevant higher-level concepts should be, and more generally it is not clear what properties should be enforced on these concepts – when learned from data – so to encourage interpretability. Existing candidates include similarity to concrete examples [chen2019looks], usage of generative models [karras19stylegan], and enforcing disentanglement among concepts [scholkopf2021toward, stammer2022interactive]. One critical challenge is that imperfections in the learned concepts may compromise the predictive power of a model as well as the the semantics of explanations while being hard to spot [nauta2020looks, hoffmann2021looks, kraft2021sparrow, mahinpei2021promises, margeloiu2021concept], calling for the development of concept-level debugging strategies [bontempelli2021toward]. Additionally, assuming a basic set of concepts has been identified, it seems likely that this set will not be sufficient and should allow for expanding [kambhampati21linguafranca].

On the broader topic of explanation requirements, liao2021human discuss many different aspects that XAI brings, such as the diverse explainability needs of stakeholders due to the no “one-fits-all” solutions from the collection of XAI algorithms, and pitfalls of XAI in the sense that there can be a gap between algorithmic explanations that several XAI works provide and the actionable understanding that these solutions can facilitate. One important statement that is highlighted in [liao2021human] is that, closing the gap between technicality of XAI and the user’s engagement with the explanations requires considering possible user interactions with XAI, and operationalizing human-centered perspectives in XAI algorithms.

This latter point also requires developing evaluation methods that better consider the actual user needs in the downstream usage contexts. Tis further raises the question of whether “good” explanations actually exist and how one can quantify these. One interesting direction forward is to consider explanation approaches in which a user can further query an initial model’s explanation similar to how humans provide additional (detailed) queries in case the initial explanation is confusing or insufficient.

6.4 Modulating and Manipulating Trust

In light of the ethical concern of deploying ML models in more real-world applications, the field of trustworthy ML has grown, which studies and pursues desirable qualities such as fairness, explainability, transparency, privacy and robustness [Varshney2019]. As discussed by liao2021human, explainability has moved beyond providing details to comprehend the ML models being developed, and it has rather become an essential requirement for people to trust and adopt AI and ML solutions.

However, the relationship between (high-quality) explanations and trust is not straightforward. One reason is that explanations are not the only contributing factor [wang2018my]. However, while user studies support the idea that explanations enable stakeholders to reject trust in misbehaving models, the oft stated claim that explanations help to establish trust into deserving models enjoys less empirical support. This is related to a trend, observed in some user studies, that participants may put too much trust in AI models to begin with [schramowski2020making], thus making the effects of additional explanations less obvious. Understanding also plays a role. Failure to understand an explanation may drive users to immediately and unjustifiably distrust the system [honeycutt2020soliciting] and, rather surprisingly, in some cases the mere fact of being exposed to the machine’s internal reasoning may induce a loss of trust [honeycutt2020soliciting]. More generally, the link between interaction and trust is under-explored.

Another important issue that explanations can be intentionally manipulated by malicious parties so to persuade stakeholders into trusting unfair or buggy systems [dombrowski2019explanations, heo2019fooling, anders2020fairwashing]. This is a serious concern for the entire enterprise of explainable AI and therefore for explanatory interaction. Despite initial efforts on making explanations robust against such attacks, more work is still needed to guarantee safety for models used in practice.

6.5 Annotation Effort and Quality

Explanatory supervision comes with its own set of challenges. First and foremost, just like other kinds of annotations, corrections elicited from human supervisors are bound to be affected by noise, particularly when interacting with non-experts [frenay2014classification]. Since explanations potentially convey much more information than simple labels, the impact of noise is manifold. One option to facilitate ensuring high-quality supervision is that of providing human annotators with guidance, cf. [cakmak2014eliciting], but this cannot entirely prevent annotation noise. In order to cope with this, it is critical to develop learning algorithms that are robust to noisy explanatory supervision. An alternative is to leverage interactive strategies for enabling users to identify and rectify mislabeled examples identified by the machine [zeni2019fixing, teso2021interactive].

Some forms of explanatory supervision – for instance, pixel-level relevance annotations – require higher effort on the annotator’s end. This extra cost is often justified by the larger impact that explanatory supervision has on the model compared to pure label information, but in most practical application effort is constrained by a tight budget and must be kept under control. Analogously to what is normally done in interactive learning [settles2012active], doing so involves developing querying strategies that only eliciting explanatory supervision when strictly necessary (e.g., when label information is not enough) and to efficiently identify the most informative queries. To the best of our knowledge, this problem space is completely unexplored.

6.6 Benchmarking and Evaluation

Evaluating and benchmarking current and novel approaches that integrate explanations and interaction is particularly challenging. There are several reasons for this. One reason is the effort in providing extensive user studies, where many recent studies tend to focus on simulated user interactions for evaluations. The difficulties for this are not just the participant organization and proper study design itself (which can lead to many pitfalls if not properly devised), but also from the engineering perspective a swift user feedback integration is not immediate in all studies. Thus, in many cases additional engineering and an extensive user study remain necessary before real-world deployment.

Further reasons are the individual use case of a method, but also the possibly high variance in user’s feedback which make it challenging to assess a methods properties with one task and study alone. A very important branch for future research is thus to develop more standardized benchmarking tasks and metrics for evaluating such methods, where friedrich2022typology

provide an initial set of important evaluation metrics and tasks for future research on XIL.

7 Related Topics

Research on integrating interaction and explanations builds on and is related to a number of different topics. The main source of inspiration is the vast body of knowledge on explainable AI, which has been summarized in several overviews and surveys [guidotti2018survey, ras2022explainable, liao2021human, gilpin2018explaining, montavon2018methods, adadi2018peeking, carvalho2019machine, sokol20factsheets, belle2021principles]. A major difference to this literature is that in XAI the communication between machine and user stops after receiving the machine’s explanations.

Recommender systems are another area where explanations have found wide applicability, with a large number of approaches being proposed and applied in real-world systems in recent years. Compatibly with our discussion, explanations has been shown to improve transparency [Rashmi2002, Nina2015] and trustworthiness [Zhang2018] of recommendations, to help users make better decisions, and to enable users to provide feedback to the system by correcting any incorrect assumptions that the recommender has made for their interests [Alkan2019]. With critiquing-based recommenders, users can critique the presented suggestions and provide their preferences [Antognini2021InteractingWE, Wu2019critique, chen2012critiquing]. Here we focus on parallel advancements made in interactive ML, and refer the reader to zhang2020explainable for a comprehensive review of explainable recommender systems.

The idea of using explanations as supervision in ML can be traced back to explanation-based learning [mitchell1986explanation, dejong1986explanation], where the goal is to extract logical concepts by generalizing from symbolic explanations, although the overall framework is not restricted to purely logical data.

Another major source of inspiration are approaches for offline explanation-based debugging, a topic has recently received substantial attention, especially in natural language processing community [lertvittayakumjorn2021explanation]. This topic encompasses, for instance, learning from annotator rationales [zaidan2007using], i.e., from snippets of text appearning in a target sentence that support a particular decision, and that effectively the same role as input attributions. Recent works have extended this setup to complex natural language inference tasks [camburu2018snli]. Another closely related topic is learning from feature-level supervision, which has been used to complement standard label supervision for learning higher quality predictors [decoste2002training, raghavan2006active, raghavan2007interactive, druck2008learning, druck2009active, attenberg2010unified, small2011constrained, settles2011closing]. These earlier approaches assume (typically complete) supervision about attribute-level relevance to be provided in advance, independently from the model’s explanations, and focus on shallow models. More generally, human explanations can be interpreted as encoding prior knowledge and thus used guide the learning process towards better aligned models faster [hase2021can]. This perspective overlaps with and represents a special case of informed ML [vonruede21informed, beckh2021explainable].

Two other areas where explanations have been used to inform the learning process are explanation-based distillation and regularization. The aim of distillation is to compress a larger model into a smaller one for the purpose of improving efficiency and storage costs. It was shown that paying attention to the explanations of the source model can tremendously (and provably) improve the sample complexity of distillation [milli2019model], hinting at the potential of explanations as a rich source of supervision. Regularization instead aims at encouraging models to produce more simulatable [wu2018beyond, wu2020regional] or faithful explanations [plumb2020regularizing] by introducing an additional penalty into the learning process. Here, however, no explanatory supervision is involved.

Two types of explanations that we have not delved into are counterfactuals and attention. Counterfactuals identify changes necessary for achieving alternative and possibly more desirable outcomes [wachter2017counterfactual], for instance what should be changed in a loan application in order for it to be approved. They have become popular in XAI as a mean to help stakeholders to form actionable plans and control the system’s behavior [lim2012improving, karimi2021algorithmic], but have recently shown promise as a mean to design novel interaction strategies [kaushik2019learning, wu2021polyjuice, de2022generating]. Attention mechanisms [bahdanau2014neural, vaswani2017attention] also offer insight into the decision process of neural networks and, although their interpretation is somewhat controversial [bastings2020elephant], they are a viable alternative to gradient-based attributions for integrating explanatory feedback into the model in an end-to-end manner [mitsuhara2019embedding, heo2020cost].

Finally, another important aspect that we had to omit is causality [pearl2009causality], which is perhaps the most solid foundation for imbuing cause-effect relationships within ML models [chattopadhyay2019neural, xu2021causality], and conversely the ability to identify causal relationships between inputs and predictions constitutes a fundamental step towards explaining model predictions [Holzinger2019, geiger2021causal]. Work on causality in interactive ML is however sparse at best.

8 Conclusion

This overview provides a conceptual guide on current research on integrating explanations into interactive machine learning for the purpose of establishing a rich bi-directional communication loop between machine learning models and human stakeholders in a way that is beneficial to all parties involved. Explanations make it possible for users to better understand the machine’s behavior, spot possible limitations and bugs in its reasoning patterns, establish control over the machine, and modulate trust. At the same time, the machine obtains high-quality, informed feedback in exchange. We categorized existing approaches along four dimensions, namely algorithmic goal, type of machine explanations involved, human feedback received, and incorporation strategy, facilitating the identification of links between different approaches as well as respective strengths and limitations. In addition, we identified a number of open problems impacting the human and machine sides of explanatory interaction and highlighted noteworthy paths of future, with the goal of spurring further research into this novel and promising approach, helping to bridge the gap towards human-centric machine learning and AI.