Aligning Faithful Interpretations with their Social Attribution

06/01/2020 ∙ by Alon Jacovi, et al. ∙ 6

We find that the requirement of model interpretations to be faithful is vague and incomplete. Indeed, recent work refers to interpretations as unfaithful despite adhering to the available definition. Similarly, we identify several critical failures with the notion of textual highlights as faithful interpretations, although they adhere to the faithfulness definition. With textual highlights as a case-study, and borrowing concepts from social science, we identify that the problem is a misalignment between the causal chain of decisions (causal attribution) and social attribution of human behavior to the interpretation. We re-formulate faithfulness as an accurate attribution of causality to the model, and introduce the concept of "aligned faithfulness": faithful causal chains that are aligned with their expected social behavior. The two steps of causal attribution and social attribution *together* complete the process of explaining behavior, making the alignment of faithful interpretations a requirement. With this formalization, we characterize the observed failures of misaligned faithful highlight interpretations, and propose an alternative causal chain to remedy the issues. Finally, we the implement highlight explanations of proposed causal format using contrastive explanations.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In formalizing the desired properties of a quality interpretation, the NLP community has settled on the key property of faithfulness lipton2016mythos; DBLP:journals/corr/abs-1711-07414; wiegreffe2019attentionisnotnot, or how “accurately” the interpretation represents the true reasoning process of the model.

Curiously, recent work in this area describes faithful interpretations of models as unfaithful when they fail to describe behavior that is “expected” of them sanjay2020, despite the interpretations seemingly complying with faithfulness definitions. We are interested in properly characterizing faithfulness, and exploring beyond faithfulness what is necessary of a quality explanation of a model decision to satisfy, in order to cope with this contradiction.

Several methods have been proposed to train and utilize models that faithfully “explain” their decisions with highlights—the guiding use-case in this work—otherwise known as an extractive rationale111In the scope of this work, we use the term “highlights” for this type of explanation. Although the term “rationale” lei16 is more commonly used for this format in the NLP community, it is controversial, as it has been used ambiguously for different things in NLP literature historically (e.g., DBLP:conf/naacl/ZaidanEP07; DBLP:conf/emnlp/BaoCYB18; eraser2019), it is a non-descriptive term and unintuitive without prior knowledge, and is seldom known or used outside of NLP disciplines. Most importantly, “rationalization” attributes human-like social behavior which is not necessarily compatible with the artificial explainer’s incentive, unless modeled explicitly. We elaborate on this justification further in Section 8., which aim to clarify what part of the input was important to the decision. The proposed models are modular in nature, composed of two stages of (1) selecting the highlighted text, and (2) predicting based on the highlighted text (select-predict, described in Section 2).

We take an extensive and critical look at the formalization of faithfulness and of explanations, with textual highlights as an example use-case. In particular, the select-predict models with faithful highlights provide us with more questions than they do answers, as we describe a variety of curious failure cases of such models in Section 4, as well as experimentally validate that the failure cases are indeed possible and do occur in practice. Concretely, the behavior of the selector and predictor in these models do not necessarily line up with expectations of people viewing the highlight. Current literature in ML and NLP interpretability fails to provide a theoretical foundation to characterize these issues (Sections 4, 5).

In order to solve this problem, we turn to literature on the science of social explanations and how they are utilized and perceived by humans (Section 6): the social and cognitive sciences find that human explanations are composed of two, equally important parts: the attribution of a causal chain to the decision process (causal attribution), and the attribution of social or human-like behavior to the causal chain (social attribution) miller2017social.

We reformalize faithfulness as the (accurate) attribution of a causal chain of reasoning steps to the model decision process, which we find is the true nature behind the vague definition provided for this term until now. Fatally, the second key component of human explanations—the social attribution of behavior—has been missing from current formalization on the desiderata of artificial intelligence explanations in NLP. In Section 

7 we define that a faithful interpretation—a causal chain of decisions—is aligned with human expectations if it is adequately constrained by the social behavior attributed to it by human observers.

Armed with this knowledge, we are able to characterize the mysterious failures of select-predict models described in Section 4: In Section 8 we delve into the social attribution of highlights as explanation, outlining two possible attributions. We find that select-predict does not guarantee either. In Section 8.2 we propose an alternative causal chain of the form of predict-select-verify. We note that the social attribution of this causal chain is of highlights to serve as evidence towards the predictor’s decision, and we formalize the constraints necessary to enforce this effect.

Finally, in Section 9 we discuss an implementation of predict-select-verify, i.e., designing the components in the roles predictor and selector. Designing the selector is non-trivial, as there are many possible options to select highlights that evidence the predictor’s decision, and we are only interested in selecting ones that are meaningful for the user to understand the decision. We leverage advancements from cognitive research on the internal structure of (human-given) explanations, dictating that explanations must be contrastive to hold tangible meaning to humans. We propose a classification select-predict-verify model which provides contrastive highlights—to our knowledge, a first in NLP—and qualitatively exemplify and showcase the solution.


We redefine “faithfulness” as the interpretation’s ability to represent the causal chain of model decisions, and formalize “aligned faithfulness” as the degree to which the causal chain is aligned with the social attribution of behavior that humans perceive from it. The new formalization allows us to identify issues with current models that derive faithful highlight interpretations, and design a new model pipeline that circumvents those issues. Finally, we propose an implementation of the new system with contrastive explanations, which are more intuitive and useful.

When designing interpretable models, we must formalize the social attribution of the interpretation—the set of constraints on model behavior to resemble human reasoning—in order to guarantee that the interpretation is aligned with expectations of human intent, in addition to being faithful.

2 Highlights as Faithful Interpretations

Highlights, also known as extractive rationales, are binary masks over a given input which imply some kind of behavioral interpretation of a particular model’s decision process to arrive at a decision on the input.

Formally, given input and model , a highlight interpretation is a binary mask over which attaches a meaning to , where the portion of highlighted by was important to the decision.

This functionality of was interpreted by lei16 as an implication of a behavioral process of , where the decision process is a modular composition of two unfolding stages:

  1. Selector component selects a binary highlight over .

  2. Predictor component makes a prediction on the input .

The final prediction of the system at inference is . We refer to as the highlight and as the highlighted text.

A highlight interpretation can be faithful or unfaithful to a model. Literature accepts a highlight interpretation as faithful if the highlight was the output of the selector component, and the input to the predictor component, which outputs the final prediction. These claims are, of course, verifiable only in select-predict models (notwithstanding advances in the understanding of other opaque models), and thus faithful highlight interpretations are limited to models of this structure.


Various methods have been proposed to build modular systems with the ability to derive faithful highlights in this way. lei16 have proposed to train the selector and predictor end-to-end via the REINFORCE williams1992simple algorithm. bastings-etal-2019-interpretable have proposed to formalize the highlight as a latent variable sequence, replacing REINFORCE with the reparameterization trick kingma-vae. Citing significant difficulty in training these solutions, jain2020 have proposed to train the selector and predictor separately, by first training a separate model on the end task, and using an unfaithful highlight interpretation method on this model to select a highlight, then training the predictor on highlighted examples. We refer to this system by its given name as FRESH. Outside of NLP, DBLP:conf/icml/ChenSWJ18

describe select-predict models for feature-vector inputs trained to maximize mutual information between the selected features and the response variable.

3 Utility of Highlights

We have described multiple possible approaches to faithful highlight interpretations. Is it enough, however, for highlights to be faithful—adhering to the select-predict composition—in order to be useful as indicators of the model’s decision process?

To answer these questions, we must first discuss potential uses of the technology. Throughout this work, we will refer to the following use-cases:

  1. Dispute: A user may want to dispute a model’s decision, e.g. in a legal setting. A user can dispute either the selector or the predictor: a user may refer to words that were not selected in the highlight, and say: “the model ignored information A, but it shouldn’t have.” The user can also refer to the words that were fed to the predictor, and say: “based on this highlighted text, I would have expected a different outcome.”

  2. Debug: Highlights allow a designation of model errors into two categories: did the model focus on the wrong part of the input, or did the model make the wrong prediction based on the correct part of the input? Based on the category, specific action may be taken to alleviate the specific error: influence the model to focus on the correct part of the input, or influence the model to make better judgements when the focus is correct.

  3. Advice: When the user does not know the correct decision and is advised by a model, the user can be advised in two ways: (1) assuming that the user trusts the model, the highlight provides feedback on which part is important to make a decision; (2) if the user does not entirely trust the model, we assume that the user has a prior notion on attributes of the highlight selection process. If the model highlight is aligned or not aligned with this prior, then the user can opt to trust or not trust the model’s decision. For example, if the highlight is focused on punctuation and stop words, as the user believes that the highlight should include content words.

4 Limitations of Highlights as (Faithful) Interpretations

We detail a variety of strange and surprising failures where select-predict models are uninformative to the above use-cases, as evidence of an apparent weakness in the formalization of faithfulness as a property of explanations. The failures are a possible risk in all current solutions of highlight interpretations.

4.1 Trojan Explanations

Task information can manifest in the highlighted text in exceedingly unintuitive ways, where the highlight is faithful, but functionally useless to the intended use-cases. For example, consider the following case of a faithfully highlighted decision process:

  1. The selector makes a prediction , then selects an arbitrary highlight that encodes .

  2. The predictor then extracts from and decodes from .

It is easy to see why the highlight becomes useless: the meaning that the highlight holds to the user is completely misaligned with the true role of the highlight in the decision process. E.g., in the advice use-case, the seemingly random highlighted text will cause the user to distrust the prediction of the model, despite the model quite realistically making informed and correct decisions based on well-generalizing reasoning.

Though this case may appear incredibly unnatural, it is nevertheless not explicitly avoided by faithful highlights of select-predict models or by the solutions available today. In other words, a highlight can still be regarded as perfectly faithful to a model that works in this way. As a result, there is no guarantee against highlights that encode task-relevant signal by themselves, without usage of the text itself. After all, we never demanded as such.

It should not be particularly surprising, then, that this is actually the case: the above “unintentional” exploit of the modular process is a perfectly valid trajectory in the training process of the highlight interpretation methods available today. We are able to verify this by attempting to predict the model’s output based on alone via another model (Table 1): although this experiment does not “prove” that the predictor utilizes this information, it shows that there is no guarantee that it doesn’t.

Model SST-2 AGNews IMDB Ev.Inf. 20News Elec
Random baseline 50.0 25.0 50.0 33.33 5.0 50.0
lei16 59.7 41.4 69.11 33.45 60.75
bastings-etal-2019-interpretable 62.8 42.4 9.45
FRESH 52.22 35.35 54.23 38.88 11.11 58.38
Table 1:

The performance of an RNN classifier using

alone as input, in comparison to the random baseline. Missing cells denote cases where we were unable to converge training.
Model AGNews IMDB 20News Elec
Full text baseline 41.39 51.22 8.91 56.41
lei16 46.83 57.83 60.4
bastings-etal-2019-interpretable 47.69 9.66
FRESH 43.29 52.46 12.53 57.7
Table 2: The performance of a classifier using quantities of the following tokens in : comma, period, dash, escape, ampersand, brackets, and star; as well as the quantity of capital letters and . Missing cells are cases where we were unable to converge training.


The above is an example of a phenomenon we term the Trojan explanation: the explanation (in our case, ) carries information which is encoded in ways that are unintuitive, difficult to discover or otherwise unintended by the user which observes the interpretation as an explanation of model behavior. In the above case, this information is the prediction label, and the “unintuitive” encoding was manifested using alone, but of course the issue is not limited to those methods. The information encoded in can be anything which will be useful to predict and which the user will find hard to comprehend.

Model Text and Highlight Prediction
(a) i really don’t have much to say about this book holder, not that it’s just a book holder. it’s a nice one. it does it’s job . it’s a little too expensive for just a piece of plastic. it’s strong, sturdy, and it’s big enough, even for those massive heavy textbooks, like the calculus ones. although, i would not recommend putting a dictionary or reference that’s like 6” thick (even though it still may hold). it’s got little clamps at the bottom to prevent the page from flipping all over the place, although those tend to fall off when you move them. but that’s no big deal. just put them back on. this book holder is kind of big, and i would not put it on a small desk in the middle of a classroom, but it’s not too big. you should be able to put it almost anywhere when studying on your own time. Positive
(b) i really don’t have much to say about this book holder, not that it’s just a book holder. it’s a nice one. it does it’s job . it’s a little too expensive for just a piece of plastic. it’s strong, sturdy, and it’s big enough, even for those massive heavy textbooks, like the calculus ones. although, i would not recommend putting a dictionary or reference that’s like 6” thick (even though it still may hold). it’s got little clamps at the bottom to prevent the page from flipping all over the place, although those tend to fall off when you move them. but that’s no big deal. just put them back on. this book holder is kind of big, and i would not put it on a small desk in the middle of a classroom, but it’s not too big. you should be able to put it almost anywhere when studying on your own time. Positive
Table 3: Highlights faithfully attributed to two fictional select-predict models on an elaborate Amazon Reviews sentiment classification example. Although highlight (a) is easier to understand, it is also far less useful, as the selector clearly made hidden decisions.

Below are cases of Trojans which are reasonably general to multiple tasks and use-cases:

  1. Highlight signal: The label is encoded in the mask alone, requiring no information from the original text it is purported to focus on.

  2. Arbitrary token mapping: The label is encoded via some mapping from highlighted tokens to labels which is considered arbitrary to the user, such as commas for a particular class, and periods for another.

  3. Arbitrary statistics mapping: The label is encoded in arbitrary statistics of the highlighted text, such as quantity of capital letters, distance between commas, and so on.

  4. The default class: In a classification case, a class can be predicted by precluding the ability to predict all other classes and selecting it by default. As a result, the selector may decide that the absence of class features in itself defines one of the classes.

In Table 2 we report on an attempt to predict the decision (by training an MLP classifier) of select-predict models from quantities of seemingly irrelevant characters, such as commas and dashes, of the highlighted texts generated by the models. The results are compared against a baseline of attempting to predict the decisions based on the same statistics on the full text. Surprisingly, all models show an increased ability to predict their decisions on some level compared to the full text, showing that the highlights selected by these models carry additional information—which was not in the original input—that can be leveraged by the predictor to make decisions.

We stress that these Trojan explanations are not merely possible, but just as reasonable to the model as any other option, and difficult to counter explicitly, as we have not yet truly formalized the extent of what is regarded as a Trojan—or why it is undesirable at all222It can be argued that for some tasks, it is reasonable for to encode label information, e.g., when particular labels are more often attributed by words in the beginning of the text, compared to others. Similar conclusions apply to other types of Trojans. How could we clarify when a property is or is not a Trojan for a given task?—and possibilities of Trojans are conceptually limitless. Indeed, Trojan explanations have been observed in practice.333By us and others. Although presented in very different context, sanjay2020 do in fact show cases of Trojan explanations in practice for a different class of compositional models.

A note about FRESH.

The FRESH jain2020 solution to faithful highlights, via training the selector and predictor separately, may seem to be a natural solution to Trojan explanations, since the two models are unable to communicate during training. We note that even in the FRESH composition, the selector was trained on the end task to make decisions before deriving a highlight. As such, it is quite possible for the selector to disguise a Trojan “enemy” in the highlighted text.

4.2 The Dominant Selector

SST-2 SST-3 SST-5 LGD AG News IMDB Ev. Inf. MultiRC Movies Beer
lei16 22.65 7.09 9.85 33.33 22.23 36.59 31.43 160.0 37.93
bastings-etal-2019-interpretable 3.31 0 2.97 201.02 199.02 12.63 85.19 75.0 13.64
FRESH 90.0 17.82 13.45 418.69 50.0 14.66 9.76 0.0 20.0
Table 4: The percentage increase in error of selector-predictor highlight methods compared to an equivalent architecture model which was trained to classify complete text. E.g., 10% means that the error is x1.1, or 110%, that of the full-text equivalent of the same architecture. We prioritize the numbers reported in previous work whenever possible, and preferably the original papers of each methodology (italics means our results). Architectures are not necessarily consistent across different cells in the table, and thus they do not imply performance superiority of absolute metrics with the “best” architectures. The highlight lengths chosen for each experiment were chosen with precedence whenever possible, and otherwise chosen as 20% following jain2020 precedence.

The Trojan explanation example serves to illustrate that the select-predict process is not always representative of the expectation that a human will have when given a highlight interpretation of this process. Unfortunately, our troubles are not limited to Trojan explanations, for it is possible to represent unintuitive decision processes through faithful highlights in other ways.

To illustrate this point, consider the case of the

binary sentiment analysis

task, where the model predicts the polarity of a particular snippet of text. Assume that two fictional selector-predictor models attempt to classify a complex, mixed-polarity review of a product. Table 3 describes two possible highlights faithfully attributed to these two models, on an example selected from the Amazon Reviews sentiment classification dataset.

Although the models made the same decision, the implied reasoning process is wildly different thanks to the different highlights chosen by the selector. In model (a), the selector clearly performed the entirety of the decision process. Had selector (a) chosen a negative-sentiment word such as “expensive”, the predictor would reasonably predict a negative sentiment for this example. In other words, for model (a), the selector dictates the entire decision process, even if the highlight is not a “Trojan” by any means.

Comparatively, the selector of model (b) performed a far simpler job, highlighting a selective summary of the review, with both positive and negative sentiments. The predictor then made a decision based on the information. How the predictor made the decision based on this highlighted text is unclear, but the division of roles between the selector and predictor is more intuitive to the observer than in the case of model (a).

Let us focus on the dispute use-case. Given a claim such as “the sturdiness of the product was not important to the final decision”, the claim appears safer in the case of model (b) than it is in the case of model (a), since in model (a), it is difficult to say how the selector arrived at the selected highlight—and thus, difficult to dispute a decision following the claim. There is an implicit understanding of the highlight of model (b) as a descriptor of a decision process that supports this claim.

To clarify,

in this failure case, the selector is not limited to reducing the input to a trivially solved snippet (such as selecting single words strongly associated with a class, per the example), but that the selector is capable of dictating the final decision with power not intended to it: even an entirely random predictor can be manipulated to perform well by a crafty selector. But why is this considered problematic, if it remains within the confines of faithful highlights?

4.3 Loss of Performance

The selector-predictor format implies a loss of performance in comparison to models that classify the full available text in “one” opaque step—refer to Table 4 for examples on the degree of decrease in performance on various classification datasets. In many cases, this decrease is severe. Although this phenomenon is treated as a reasonable and matter-of-fact necessity by literature on highlight interpretations and rationales, the intuition behind it is not trivial to us.

Naturally, humans are able to provide highlights of decisions without any loss of accuracy, even when the selector and predictor are separate people jain2020. In addition, while interpretability may be prioritized over state-of-the-art performance in certain use-cases, there are also cases that will disallow the implementation of artificial models unless they are strong and interpretable, simultaneously.444For example, consider the case of a doctor or patient seeking life-saving advice—it is difficult to quantify a trade-off between performance and explanation ability.

Often, sufficiency is deemed to be a desiderata of highlights: the highlighted text should be sufficient to make an informed judgement yu2019comprehensiveness-sufficiency; eraser2019 towards the correct label. Under the selector-predictor setup, the highlighted text is at least guaranteed to be sufficient to predict the same result as the model.

In this context, the following question is critical to the design of highlight models: why do models that provide highlight interpretations surrender performance to do so? Is this phenomenon necessary, or desirable? And why?

This question can be re-interpreted in the following way: in the causal chain of events behind the decision process for a given task, which comes first: the prediction, or the highlight selection? Is the answer constant, or contextual?

Consider the case of agreement classification, where the task is to classify whether a given snippet contains grammatically disagreeing words. In this case, most would agree that it is impossible to provide a sufficient highlight without making a decision first. However, for example, when deciding the polarity of a review snippet in a sentiment classification task, it may be natural to first make a selection of relevant phrases before finally making a decision based on them. Can we say that it is appropriate to surrender performance in order to provide highlight interpretations in either case, in both, or in neither?

4.4 Conclusion

The described failure cases shed light on a missing step in the derivation of interpretations that are useful, despite adhering faithfully to a select-predict model. We conclude that faithfulness is insufficient to formalize the desired properties of a behavioral explanation behind the decision of an artificial intelligence model.

5 Plausibility is Not the Answer

Following the failure cases described in Section 4, it may be theorized that plausibility is a desirable, or even necessary, condition for useful highlights and interpretations in general. After all, Trojan explanations are by default implausible. However, we argue that this is far from the case.

Plausibility wiegreffe2019attentionisnotnot or persuasiveness DBLP:journals/corr/abs-1711-07414 is the property of how convincing the interpretation is towards the model prediction, regardless of whether the model was correct or not, and regardless of whether the interpretation is faithful or not. It is a property inspired by human-given explanations, which are post-hoc stories generated to plausibly justify our actions rudin2019stop. Plausibility is generally quantified by the degree that the model’s highlights resemble gold-annotated highlights given by humans bastings-etal-2019-interpretable or by querying for the feedback of humans directly jain2020.

Supposing that faithfulness has been “achieved” (unclear as that condition may be), plausibility is still without utility unless this plausibility is correlated with the likelihood of the model to be correct jacovi2020. Although it is possible to quantify this correlation via Human-Computer Interaction (HCI) assignments (e.g., 10.1145/3301275.3302265), it remains irrelevant to other use-cases of interpretability, since model correctness is intractable.

The failures discussed above stem not from how convincing the interpretation is, but from how well the user understands the reasoning process of the model. If the user is able to comprehend the steps that the model has taken to its decision, then the user will be the one to decide whether these steps are plausible or not, based on how closely these steps fit the user’s prior knowledge on whatever correct steps should be taken—whether the user knows the correct answer or not, and whether the model is correct or not.

For example, in Figure 3, we are not interested in whether the highlighted text is plausible as an explanation to the decision—in which case, model (a) will likely be deemed superior—but rather, which highlight implies a more coherent decision process that the user can comprehend. It is the latter property, rather than the former, that will allow the model to be useful in any of the use-cases of highlight interpretations, such as dispute or debug.

This means that the important attribute of model (b)’s highlight is not that it may be a closer match to a gold-annotated human highlight; but that the roles of the selector and predictor resemble those of a human decision maker. The difference is that even if the model made a “wrong” decision, the latter attribute will stay valid, unlike the former.

6 On Faithfulness, Plausibility, and Explainability from the Science of Human Explanations

Clearly, the mathematical foundation of machine learning and natural language processing is insufficient to tackle the underlying issue behind the painful symptoms described in Section

4. In fact, formalizing the problem itself is difficult. What enables a faithful explanation to be understood as accurate to the model? And what causes an explanation to be perceived as a Trojan?

Although work by the community in this direction is well placed, it often neglects to draw upon the extremely vast library of work on the science of human explanation. As a result, some of the work re-invents the wheel, and we have yet arrived at a satisfactory formalization. We will attempt to better understand the problem by looking to the social sciences, and particularly philosophical research on human explanations of (human) behavior, to assist us.555Refer to miller2017social for a substantial survey in this area, which was especially motivating to us.

6.1 The Composition of Explanations

miller2017social describes explanations of behavior as a social interaction of knowledge transfer between the explainer and the explainee, and thus they are contextual, and can be perceived differently depending on this context. Two central pillars of the explanation are causal attribution—the attribution of a causal chain666See hilton05 for a breakdown of types of causal chains; we focus on unfolding chains in this work, but others may be relevant as well. of events to the behavior—and social attribution—the attribution of behavior to others heider58.

Causal attribution describes faithfulness.

We note a stark parallel between causal attribution and faithfulness, as it is understood by the NLP community: for example, the select-predict composition of modules defines an unfolding causal chain where the selector hides portions of the input, causing the predictor to make a decision based on the remaining portions.

Social attribution is missing.

heider44 describe an experiment where participants attribute human concepts of emotion, intentionality and behavior to animated shapes. Clearly, the same phenomenon persists when humans attempt to understand the predictions of artificial models. What is the behavior attributed to the select-predict causal chain? And can models be constrained to adhere to this attribution?

Although informally, prior work on highlights has considered such factors before. lei16 describe desiderata for highlights as being “short and consecutive”, and jain2020 interpreted “short” as “around the same length as that of human-annotated highlights”. We assert that the true nature of these claims is an attempt to constrain highlights to the social behavior implicitly attributed to them by human observers in the select-predict paradigm. We discuss this further in the next section.

Plausibility as an incentive of the explainer, and not as a property of the explanation.

The utility of human explanations can be categorized across multiple axes miller2017social, among them are (1) learning a better internal model for future decisions and calculations lombrozo2006structure; williams2013hazards; (2) examination to verify the explainer has a correct internal prediction model; (3) teaching777Although (1) and (3) are considered one-and-the-same in the social sciences, we disentangle them as that is only the case when the explainer and explainee are both human. to modify the internal model of the explainer towards a more correct one (can be seen as the opposite end of (1)); (4) assignment of blame to a component of the internal model; and finally, (5) justification and persuasion.

Critically, the goal of justification and persuasion by the explainer may not necessarily be the goal of the explainee. Indeed, in the case of AI explainability, it is not a goal of the explainee to be persuaded that the decision is correct (even when it is), but to understand the decision process. If plausibility is a goal of the artificial model, this perspective outlines a game theoretic mismatch of incentives between the two players. And specifically in cases where the model is incorrect, it is interpreted as the difference between an innocent mistake and an intentional lie—of course, lying is considered more unethical. As a result, modeling and pursuing plausibility in AI explanations is an ethical issue.

7 Aligned Faithfulness

We have covered the separation of causal attribution to the model from attribution of social behavior to it. In human explanations, this separation is formulated as a multi-step process in the transfer of knowledge from one person to the other. Unique to artificial explainers is a problem where there is a misalignment between the causal chain behind the decision, and the social attribution to it, as the (artificial) decision process does not necessarily resemble human behavior.

We claim that this problem is the root cause behind the symptoms described in Section 4. In this section we define the general desiderata of interpretations to satisfy to avoid this issue, separated from the narrative of highlight interpretations.

7.1 Definition

By presenting to the user the causal pipeline of decisions in the model decision process as an interpretation of the decision process, the user naturally conjures social intent behind this pipeline. This intent is formalized as a set of constraints on the possible range of decisions at each step in the causal chain, which the model must adhere to in order to be considered as comprehensible to the user.

For an interpretation method to accurately describe the causal chain of decisions in a model decision, we say that it is faithful; and having accomplished that, for the method and model to adhere to the social attribution of behavior by humans, we say that the interpretation is aligned, short for human-aligned, or “aligned with human expectations”. Furthermore, we claim that this attribution of behavior is heavily contextual on the task and use-case of the model, and that it is incompatible with plausibility.

8 Social Attribution of Highlight Interpretations

As previously mentioned, unique to our setting in NLP and ML is the fact that the social attribution must lead the design of the causal chain, since we have control over one and not the other. In other words, we must first identify the behavior expected of the decision process, and constrain the decision process around it.

We arrive at two parallel attributions of human behavior to highlights, where each carries separate and distinct constraints on whether the highlight can describe behavior aligned with human intent.

Highlights as summaries.

The highlight serves as an extractive summary of the input text, filtering irrelevant information, making it easier to focus on the portions of the input most important.888The definition of a summary and the utility of summarization in tasks yu2013-summaries is beyond the scope of this work. To illustrate, recall a student who is marking sentences and paragraphs in a book to make studying for a test easier. The highlight is merely considered a compressed version of the input, with sufficient information to make informed decisions in the future. It is not selected with an answer in mind, but in anticipation that an answer will be derived in the future, for a question that may not have even been asked yet.

Highlights as evidence.

Another name for highlight explanations in the NLP community is extractive rationales, or rationalization, due to this human characterization of the artificial explainer: the highlight serves as evidence towards a prior decision. The decision process behind the prior decision must consider the highlight as supporting evidence of the decision, whether the highlight is sufficient, comprehensive, or neither.

8.1 Issues with Select-Predict

Unfortunately, the select-predict causal chain supports neither attribution:

  1. A summarizing selector is expected to filter out info which is irrelevant to making an informed decision, without making any decision in doing so. It is not guaranteed by black-box selectors which were explicitly trained on the end-task.

  2. An evidencing or rationalizing selector should select the part of the input that supports the final decision without influencing this decision. This is clearly not the case in select-predict models.

The issues discussed in Section 4 are direct results of the above conflation of interests. Trojan highlights and the dominant selector are a result of a selector that makes hidden and unintended decisions, so they serve as neither summary nor evidence towards the predictor’s decision. Loss of performance is due to the selector acting as an imperfect summarizer.

8.2 Predict-Select-Verify

We focus on highlights as evidence: we propose the predict-select-verify causal chain as a solution that can be constrained to provide highlights as evidence. The decision pipeline is as followings:

  1. The predictor makes a prediction on the full text.

  2. The selector selects such that .

In this framework, the selector provides evidence which is verified to be useful to the predictor towards a particular decision. Importantly, the final decision has been made on the full text, and the selector is constrained to provide a highlight that adheres to this exact decision. The selector does not purport to provide a highlight which is comprehensive of all evidence considered by the predictor, but it provides a guarantee that the highlighted text is sufficient for this prediction.

It is trivial to see that the predict-select-verify chain addresses all points of Section 4: Trojan highlights and dominant selectors are impossible, as the selector is constrained to only provide “retroactive” selections towards a specific priory-decided prediction. In other words, the selector has no power to influence the decision, since the decision was already made without its intervention. Loss of performance is impossible as the predictor is not constrained to make predictions on possibly insufficient subsets of the input (as it would be under a summarizing selector).

Text and Highlight Label
Ohio Sues Best Buy, Alleging Used Sales (AP): AP - Ohio authorities sued Best Buy Co. Inc. on Thursday, alleging the electronics retailer engaged in unfair and deceptive business practices. Business Business Science/Tech
HK Disneyland Theme Park to Open in September: Hong Kong’s Disneyland theme park will open on Sept. 12, 2005 and become the driving force for growth in the city’s tourism industry, Hong Kong’s government and Walt Disney Co. Business Business World
Poor? Who’s poor? Poverty is down: The proportion of people living on less than $1 a day decreased from 40 to 21 percent of the global population between 1981 and 2001, says the World Bank’s latest annual report. World World Business
Poor? Who’s poor? Poverty is down: The proportion of people living on less than $1 a day decreased from 40 to 21 percent of the global population between 1981 and 2001, says the World Bank’s latest annual report. World World Business
Poor? Who’s poor? Poverty is down: The proportion of people living on less than $1 a day decreased from 40 to 21 percent of the global population between 1981 and 2001, says the World Bank’s latest annual report. World World Business
Table 5: Examples of contrastive highlights (§ 9) of instances from the AG News corpus. The model used for is fine-tuned bert-base-cased. Yellow highlight refers to and yellow-and-red refers to .

9 Constructing a Predict-Select-Verify Model with Contrastive Explanations

In order to design a model adhering to the aforementioned constraints, we require solutions for the predictor and for the selector.


The predictor is constrained to be able to accept both full-text inputs and highlighted inputs. For this reason, we use masked language modeling (MLM) models, such as BERT, devlin2018bert fine-tuned on the downstream task. The MLM pre-training is performed by recovering partially masked text, which conveniently suits our needs. We additionally provide randomly-highlighted inputs to the model during fine-tuning.


The selector is constrained to select highlights for which the predictor made the same decision as it did on the full text. However, there are likely many possible choices that the selector may make under this constraints, as there are many possible highlights that all result in the same decision by the predictor. We wish for the selector to select meaningful evidence to the predictor’s decision. What is meaningful evidence? To answer this question, we again refer to cognitive science on necessary attributes of explanations that are easy to comprehend by humans.

9.1 Fact and Foil

An especially relevant finding of social science literature is of contrastive explanations, following the notion that the question “why ?” is may be followed by an addendum: “why , rather than ?” hilton1988logic. We refer to as the fact and as the foil lipton1990contrastive. The concrete valuation in the community is that in the vast majority of cases, the cognitive burden of a “complete” explanation, i.e. where is , is too great, and thus is selected as a subset of all possible foils hilton1986knowledge; Hesslow1988, and often not explicitly, but implicitly derived from context.

In classification tasks, the implication is that an interpretation of a prediction of a specific class is hard to understand, and should be contextualized by the preference of the class over another—and the selection of the foil (the non-predicted class) is non-trivial, and a subject of ongoing discussion even in human explanations literature.

Contrastive explanations have many implications for explanations in AI, and particularly for highlights, as a vehicle for explanations that are easy to understand. Although there is a modest body of work on contrastive explanations in machine learning Dhurandhar2018ExplanationsBO; Chakraborti2019ContrastiveAF; Chen2020TowardsTR, to our knowledge, there are none in NLP.

9.2 Contrastive Highlights

An explanation in a classification setting should not only addresses the fact, but do so against a foil—where the fact is the class predicted by the model, and the foil is another class. Given two classes and ,999Selecting the foil, or selecting what to explain, is a difficult and interesting problem even in philosophical literature Hesslow1988; mcgill93; chin2017contrastive. In the classification setting, it is relatively simple, as we may request the foil (class) from the user, or provide separate contrastive explanations for each foil. where we are interested in deriving a contrast highlight explanation towards the question: “why did you choose , rather than ?”.

We assume a scenario where, having observed , the user is aware of some highlight which should serve, they believe, as evidence for class .101010In this work, we assume both and are provided by the user. Additional strategies are possible, such as deriving candidates for

(heuristically or otherwise) given

, and allowing the user to choose and modify the candidates. In other words, we assume the user believes a pipeline where and is reasonable.

If , then the user is made aware that the predictor disagrees that serves as evidence for .

Otherwise, . We define:

is the minimal highlight containing such that . Intuitively, the claim by the model is as such: “Because of , I consider as evidence towards despite .”

We show examples of an implementation of this process in Table 5 on examples from the AG News dataset. For illustration purposes, we selected incorrectly-classified examples, and selected the foil to be the true label of the example. The highlight for the foil was chosen by us.

We stress that while the examples presented in these figures appear reasonable, the true goal of this method is not to provide highlights that seem justified, but to provide a framework which allows models to be meaningfully incorporated in use-cases of dispute, debug, and advice, with robust and proven guarantees of behavior.

For example, in each of the example use-cases:

  1. Dispute: The user intends on verifying whether the model “correctly” considered a specific portion of the input when making the decision. I.e., the model made decision , where the user believes decision should have been made, and is supported by evidence . If , it is possible to dispute the claim that the model interpreted with “correct” evidence intent. Otherwise the dispute claim cannot be made, as the model provably considered as evidence for , yet insufficiently so when combined with as .

  2. Debug: Assuming is incorrect, the user intends on performing error analysis by observing which portion of the input is sufficient to steer the predictor away from the correct decision . This is provided by .

  3. Advice: When the user is unaware of the answer, and is seeking perspective from a trustworthy model: then the user is given explicit feedback on which part of the input the model “believes” is sufficient to overturn the signal in towards . Otherwise, if the model is not considered trustworthy, the user may gain or reduce trust by observing whether and align with user priors.

10 Discussion

Causal attribution of heat-maps.

Recent work on the faithfulness of attention heat-maps Baan2019DoTA; Pruthi2019LearningTD; jain2019attentionisnot; sofia-isattentioninterpretable; wiegreffe2019attentionisnotnot or saliency distributions alvarez2018robustness; kindermans2017unreliability-of-saliency cast doubt on their faithfulness as indicators to the significance of parts of the input (to a model decision). We argue that is a natural conclusion from the fact that as a community, we have not envisioned an appropriate causal chain that utilizes heat-maps in the decision process, reinforcing the claims in this work on the parallel between causal attribution and faithfulness.

Inter-disciplinary research.

Research on explanations in artificial intelligence will benefit from a deeper inter-disciplinary perspective on two axes: (1) literature on causality and causal attribution; and (2) literature on the social perception and attribution of human-like behavior to causal chains of decision or behavior.

10.1 Related Work

Relevant to the issue of how the explanation is comprehended by people is simulatability kim2017interpretability, or the degree to which humans can simulate model decisions. hase2020simulatability further refines simulatability evaluation by defining testing regimen where it can be properly assessed without false signal. While quantifying simulatability may be related to aligned faithfulness in some way, they are decidedly different, since e.g., simulatabiliy will not necessarily detect dominant selectors (Section 4), such as those which reduce the input to a trivial instance of a prior task prediction.

Predict-select-verify is reminiscent of iterative erasure as described by feng2018pathologies. By iteratively removing “significant” tokens in the input, the authors discovered that a surprisingly small portion of the input could be interpreted as evidence for the model to make the prediction, leading to conclusions on the pathological nature of neural models and sensitivity to badly-structured inputs. This experiment retroactively serves as a successful use-case of examination and debugging using our described formulation.

Kottur2017NaturalLD; sanjay2020 describe failure cases of Trojans where communicating modules encode information unintuitively, despite an interpretable interface between them. sanjay2020 additionally infer a particular social attribution on the causal chain of their proposed model for tasks of compositional reasoning with neural modular networks, with the motivation of overcoming the Trojan issue, though the work is not formalized as such.

The approach by DBLP:conf/nips/ChangZYJ19 for class-wise highlights is reminiscent of contrastive highlights, but nevertheless distinct, since such highlights still explain a fact against all possible foils.

11 Conclusion

Highlights are a popular format for explanations of decisions on textual inputs, relatively unique in that there are models available today with the ability to derive highlights faithfully. We analyze highlights as a guiding use-case in the pursuit of more rigorous formalization of what makes a quality explanation of an artificial intelligence model.

We redefine faithfulness as the degree to which the interpretation of causal events accurately represents the causal chain of decision making towards the decision. Next, we define aligned faithfulness as the degree to which the various decisions in the causal chain are constrained by the social attribution of behavior that humans expect from the interpretation. The two steps of causal attribution and social attribution together complete the process of explaining the decision process of the model to the human observer.

Using this formalization, we characterize various failures in faithful highlights that “seemed” strange, but could not be properly described previously, noting they are not properly constrained by their social attribution as summaries. We propose an alternative causal chain which can be constrained by the attribution of highlights as evidence. Finally, we illustrate a possible implementation of this model with practical utility of disputing, debugging or being advised by model decisions, by formalizing contrastive explanations in the highlight format.


Appendix A Experimental Setup

a.1 Datasets

We elaborate on the experiments used to illustrate the points raised in Section 4. For the most part, these tasks are within precedence of relevant literature on highlights and rationales. The datasets use prior train/validation/test set splits when available. When only train/test were available, 20% of the train set was held out for validation. We elaborate below when no splits are available. Some of these datasets were not used by us explicitly, but we mention them here as we include performance reports on them (provided in previous papers).

Stanford Sentiment Treebank (SST-2/3/5) DBLP:conf/emnlp/SocherPWCMNP13

Sentences or snippets of movie reviews with 2, 3 or 5-bucket polarity.

AG’s News Corpus (AG News) DBLP:conf/www/CorszoGR05

News articles categorized into four topics of Science/Tech, Sports, Business, World.

Evidential Inference (Ev.Inf.) lehman-etal-2019-inferring

Based on biomedical articles of controlled trials, and a given intervention, the task is to infer the outcome of the experiment as a result of the intervention (among significantly increased, significantly decreased, and no significant change), and give supporting portions of the text towards this conclusion. We use the article abstracts, and do not evaluate the extracted supporting snippet.

Movies DBLP:conf/emnlp/ZaidanE08

Movie reviews labeled for binary polarity. Human-annotated highlights were collected for this dataset by eraser2019.

Multi RC DBLP:conf/naacl/KhashabiCRUR18

Binary classification of questions, answers and supporting snippets, into true and false labels. The binary setup was formulated by eraser2019.

Lgd linzen2016assessing

A collection of template text sentences, each with two options for an agreeing or disagreeing pair of words, for evaluation on the syntactic ability of models. We converted this dataset into a binary classification dataset by requiring prediction on whether a given sentence contains a disagreement or whether all words are in agreement. This dataset is unique in that the choice of highlight is strict and punishing if not “correct”, providing a strong challenge to the selector in select-predict models. We have employed a train/dev/test split of 50%/25%/25% and enforced no overlap in sentences between the splits.

20 Newsgroups (20 News) DBLP:conf/icml/Lang95; DBLP:journals/ml/NigamMTM00

A collection of newsgroup documents partitioned evenly across twenty categories.

Amazon Reviews DBLP:conf/recsys/McAuleyL13 and Elec DBLP:conf/naacl/Johnson015

Amazon product reviews with binary polarity based on user star rating. Elec refers to electronic products only.

Beer DBLP:conf/icdm/McAuleyLJ12

A collection of BeerAdvocate reviews with polarity scores on four aspects of appearance, smell, palate and taste.