Can I Trust the Explainer? Verifying Post-hoc Explanatory Methods

by   Camburu Oana-Maria, et al.

For AI systems to garner widespread public acceptance, we must develop methods capable of explaining the decisions of black-box models such as neural networks. In this work, we identify two issues of current explanatory methods. First, we show that two prevalent perspectives on explanations---feature-additivity and feature-selection---lead to fundamentally different instance-wise explanations. In the literature, explainers from different perspectives are currently being directly compared, despite their distinct explanation goals. The second issue is that current post-hoc explainers have only been thoroughly validated on simple models, such as linear regression, and, when applied to real-world neural networks, explainers are commonly evaluated under the assumption that the learned models behave reasonably. However, neural networks often rely on unreasonable correlations, even when producing correct decisions. We introduce a verification framework for explanatory methods under the feature-selection perspective. Our framework is based on a non-trivial neural network architecture trained on a real-world task, and for which we are able to provide guarantees on its inner workings. We validate the efficacy of our evaluation by showing the failure modes of current explainers. We aim for this framework to provide a publicly available, off-the-shelf evaluation when the feature-selection perspective on explanations is needed.



There are no comments yet.


page 1

page 2

page 3

page 4


On the Objective Evaluation of Post Hoc Explainers

Many applications of data-driven models demand transparency of decisions...

Post-hoc explanation of black-box classifiers using confident itemsets

It is difficult to trust decisions made by Black-box Artificial Intellig...

Model extraction from counterfactual explanations

Post-hoc explanation techniques refer to a posteriori methods that can b...

Towards Automated Evaluation of Explanations in Graph Neural Networks

Explaining Graph Neural Networks predictions to end users of AI applicat...

Bridging the Gap: Providing Post-Hoc Symbolic Explanations for Sequential Decision-Making Problems with Black Box Simulators

As more and more complex AI systems are introduced into our day-to-day l...

S-LIME: Stabilized-LIME for Model Explanation

An increasing number of machine learning models have been deployed in do...

Partial order: Finding Consensus among Uncertain Feature Attributions

Post-hoc feature importance is progressively being employed to explain d...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A large number of explanatory methods have recently been developed with the goal of shedding light on highly accurate, yet black-box machine learning models 

(lime; shap; lrp; deeplift; lime2; lime3; maple; l2x). Among these methods, there are currently at least two widely used perspectives on explanations: feature-additivity (lime; shap; deeplift; lrp) and feature-selection (l2x; anchors; what-made-you-do-this), which we describe in detail in the sections below. While both shed light on the overall behavior of a model, we show that, when it comes to explaining the prediction on a single input in isolation, i.e., instance-wise explanations, the two perspectives lead to fundamentally different explanations. In practice, explanatory methods adhering to different perspectives are being directly compared. For example, l2x and invase compare L2X, a feature-selection explainer, with LIME (lime) and SHAP (shap), two feature-additivity explainers. We draw attention to the fact that these comparisons may not be coherent, given the fundamentally different explanation targets, and we discuss the strengths and limitations of the two perspectives.

Secondly, while current explanatory methods are successful in pointing out catastrophic biases, such as relying on headers to discriminate between pieces of text about Christianity and atheism (lime), it is an open question to what extent they are reliable when the model that they aim to explain (which we call the target model) has a less dramatic bias. This is a difficult task, precisely because the ground-truth decision-making process of neural networks is not known. Consequently, explainers are often evaluated under the assumption that the target model behaves reasonably, i.e., that the model did not rely on an irrelevant correlation. For example, in their morphosyntactic agreement paradigm, nina assume that a model that predicts if a verb should be singular or plural given the tokens before the verb, must be doing so by focusing on a noun that the model had identified as the subject. Such assumptions may be poor, since recent works show a series of surprising spurious correlations in human-annotated datasets, on which neural networks learn to heavily rely (artifacts; breaking; behaviour). Therefore, it is not reliable to penalize an explainer for pointing to tokens that just do not appear significant to us.

We address the lack of thorough evaluation frameworks for explainers on real-world neural networks by proposing an evaluation framework under the feature-selection perspective that we instantiate on three (target model, dataset) pairs, on the task of multi-aspect sentiment analysis. Each pair corresponds to an aspect and the three models (of same architecture) have been trained independently. Given a pair, for each instance in the dataset, the specific architecture of our model allows us to identify a subset of tokens that have zero contribution to the model’s prediction on the instance. We further identify a subset of tokens clearly relevant to the prediction. Hence, we test if explainers rank zero-contribution tokens higher than relevant tokens. We highlight that our test is not a sufficient test for concluding the power of explainers in full generality, since we do not know the whole ground-truth behavior of the target models. It is nonetheless a necessary evaluation test, since our metrics penalize explainers only when we are able to guarantee that they produced an error. To our knowledge, we are the first to introduce an automatic and non-trivial neural-network-based evaluation test that does not rely on speculations on the behavior of the target model.

Finally, we evaluate L2X (l2x), a feature-selection explainer, under our test. Even though our test is specifically designed for feature-selection explanatory methods, since, in practice, the two types of explainers are being compared, and, since LIME (lime) and SHAP (shap) are two very popular explainers, we were interested in how the latter perform on our test, even though they adhere to the feature-additivity perspective. Interestingly, we find that, most of the time, LIME and SHAP perform better than L2X. We will detail in Section 5 the reasons why we believe this is the case. We provide the error rates of these explanatory methods to raise awareness of their possible modes of failure under the feature-selection perspective of explanations. For example, our findings show that, in certain cases, the explainers predict the most relevant token to be among the tokens with zero contribution. We will release our test, which can be used off-the-shelf, and encourage the community to use it for testing future work on explanatory methods under the feature-selection perspective. We also note that our methodology for creating this evaluation is generic and can be instantiated on other tasks or areas of research.

2 Related Work

The most common instance-wise explanatory methods are feature-based, i.e., they explain a prediction in terms of the input unit-features (e.g., tokens for text and super-pixels for images). Among the feature-based explainers, there are two major types of explanations: (i) feature-additive: provide signed weights for each input feature, proportional to the contributions of the features to the model’s prediction (lime; shap; deeplift; lrp), and (ii) feature-selective: provide a (potentially ranked) subset of features responsible for the prediction (l2x; anchors; what-made-you-do-this). We discuss these explanatory methods in more detail in Section 3. Other types of explainers are (iii) example-based (ex-based): identify the most relevant instances in the training set that influenced the model’s prediction on the current input, and (iv) natural language explainers (esnli; zeynep; cars): design models capable of generating full-sentence justifications for their own predictions. In this work, we focus on verifying feature-based explainers, since they represent the majority of current works.

While many explainers have been proposed, it is still an open question how to thoroughly validate their faithfulness to the target model. There are four types of evaluations commonly performed:

  1. Interpretable target models.

    Typically, explainers are tested on linear regression and decision trees (e.g., LIME 


    ) or support vector representations (e.g., MAPLE 

    (maple)). While this evaluation accurately assesses the faithfulness of the explainer to the target model, these very simple models may not be representative for the large and intricate neural networks used in practice.

  2. Synthetic setups. Another popular evaluation setup is to create synthetic tasks where the set of important features is controlled. For example, L2X (l2x) was evaluated on four synthetic tasks: 2-dim XOR, orange skin, nonlinear additive model, and switch feature. While there is no limit on the complexity of the target models trained on these setups, their synthetic nature may still prompt the target models to learn simpler functions than the ones needed for real-world applications. This, in turn, may ease the job for the explainers.

  3. Assuming a reasonable behavior.

    In this setup, one identifies certain intuitive heuristics that a high-performing target model is assumed to follow. For example, in sentiment analysis, the model is supposed to rely on adjectives and adverbs in agreement with the predicted sentiment. Crowd-sourcing evaluation is often performed to assert if the features produced by the explainer are in agreement with the model’s prediction 

    (shap; l2x). However, neural networks may discover surprising artifacts (artifacts) to rely on, even when they obtain a high accuracy. Hence, this evaluation is not reliable for assessing the faithfulness of the explainer to the target model.

  4. Are explanations helping humans to predict the model’s behaviour? In this evaluation, humans are presented with a series of predictions of a model and explanations from different explainers, and are asked to infer the predictions (outputs) that the model will make on a separate set of examples. One concludes that an explainer is better than an explainer if humans are consistently better at predicting the output of the model after seeing explanations from than after seeing explanations from  (anchors). While this framework is a good proxy for evaluating the real-world usage of explanations, it is expensive and requires considerable human effort if it is to be applied on complex real-world neural network models.

In contrast to the above, our evaluation is fully automatic, the target model is a non-trivial neural network trained on a real-world task and for which we provide guarantees on its inner-workings. Our framework is similar in scope with the sanity check introduced by sanity. However, their test filters for the basic requirement that any explainer should provide different explanations for the trained model on real data then when the data and/or model are randomized. Our test is therefore more challenging and requires a stronger fidelity of the explainer to the target model.

3 Instance-wise Explanations

As mentioned before, current explanatory methods adhere to two major perspectives of explanations:

Perspective 1 (Feature-additivity): For a model and an instance , the explanation of the prediction consists of a set of contributions for each feature of such that the sum of the contributions of the features in approximates , i.e., .

Many explanatory methods adhere to this perspective (lrp; deeplift; lime). For example, LIME (lime) learns the weights via a linear regression on the neighborhood (explained below) of the instance. shap

unified this class of methods by showing that the only set of feature-additive contributions that verify three desired constraints (local accuracy, missingness, and consistency—we refer to their paper for details) are given by the Shapley values from game theory:


where the sum enumerates over all subsets of features in that do not include the feature , and denotes the number of features of its argument.

Thus, the contribution of each feature in the instance is an average of its contributions over a neighborhood of the instance. Usually, this neighborhood consists of all the perturbations given by masking out combinations of features in ; see, e.g., (lime; shap). However, neighbourhood show that the choice of the neighborhood is critical, and it is an open question what neighborhood is best to use in practice.

Perspective 2 (Feature-selection): For a model and an instance , the explanation of consists of a sufficient (ideally small) subset of (potentially ranked) features that alone lead to (almost) the same prediction as the original one, i.e., .

l2x; what-made-you-do-this, and anchors adhere to this perspective. For example, L2X (l2x) learns by maximizing the mutual information between and the prediction. However, it assumes that the number of important features per instance, i.e., , is known, which is usually not the case in practice. A downside of this perspective is that it may not always be true that the model relied only on a (small) subset of features, as opposed to using all the features. However, this can be the case for certain tasks, such as sentiment analysis.

To better understand the differences between the two perspectives, in Figure 1, we provide the instance-wise explanations that each perspective aims to provide for a hypothetical sentiment analysis regression model, where is the most negative and the most positive score. We note that our hypothetical model is not far, in behaviour, from what real-world neural networks learn, especially given the notorious biases in the datasets. For example, artifacts show that natural language inference neural networks trained on SNLI (snli) may heavily rely on the presence of a few specific tokens in the input, which should not even be, in general, indicators for the correct target class, e.g., “outdoors” for the entailment class, “tall” for the neutral class, and “sleeping” for the contradiction class.

                          M: if “very good” in input: return 0.9;                                if “nice” in input: return 0.7;                                if “good” in input: return 0.6;                                return 0. : “The movie was good, it was actually : “The movie was nice, in fact, it was   nice.”   very good.” Feature-additivity Feature-selection Feature-additivity Feature-selection nice: 0.4 {nice} good: 0.417 {good, very} good: 0.3 nice: 0.367 rest of tokens: 0 very: 0.116 rest of tokens: 0

Figure 1: Examples on which the two perspectives give different instance-wise explanations.

In our examples in Figure 1, we clearly see the differences between the two perspectives. For the instance , the feature-additive explanation tells us that “nice” was the most relevant feature, with a weight of , but also that “good” had a significant contribution of . While for this instance alone, our model relied only on “nice” to provide a positive score of , it is also true that, if “nice” was not present, the model would have relied on “good” to provide a score of . Thus, we see that the feature-additive perspective aims to provide an average explanation of the model on a neighborhood of the instance, while the feature-selective perspective aims to tell us the pointwise features used by the model on the instance in isolation, such as “nice” for instance .

An even more pronounced difference between the two perspectives is visible on instance , where the ranking of features is now different. The feature-selective explanation ranks “good” and “nice” as the two most important features, while on the instance in isolation, the model relied on the tokens “very” and “good”, that the feature-selection perspective would aim to provide.

Therefore, we see that, while both perspectives of explanations give insights into the model’s behavior, one perspective might be preferred over the other in different real-world use-cases. For example, in tasks where one takes individual decisions (e.g., credit allowance or health diagnostics), one may prefer the feature-selection perspective in order to know only the features used for the instance itself and not in average over a neighbourhood. On the other hand, for tasks where one may want to know what the model would do in the neighborhood of the instance (e.g., would M resort to the token good to still provide a positive sentiment if nice is not present in the instance ?), one may prefer the feature-additivity perspective. In the rest of the paper, we propose a verification framework for the feature-selection perspective of instance-wise explanations.

4 Our Verification Framework

Our proposed verification framework leverages the architecture of the RCNN model introduced by rcnn. We further prune the original dataset on which the RCNN had been trained to ensure that, for each datapoint , there exists a set of tokens that have zero contribution (irrelevant features) and a set of tokens that have a significant contribution (clearly relevant features) to RCNN’s prediction on . We further introduce a set of metrics that measure how explainers fail to rank the irrelevant tokens lower than the clearly revelant ones. We describe each of these steps in detail below.


The RCNN (rcnn)

consists of two modules: a generator followed by an encoder, both instantiated with recurrent convolutional neural networks 

(lei15). The generator is a bidirectional network that takes as input a piece of text

and, for each of its tokens, outputs the parameter of a Bernoulli distribution. According to this distribution, the RCNN selects a subset of tokens from

, called , and passes it to the encoder, which makes the final prediction solely as a function of . Thus:


There is no direct supervision on the subset selection, and the generator and encoder were trained jointly, with supervision only on the final prediction. The authors also used two regularizers: one to encourage the generator to select a short sub-phrase, rather than disconnected tokens, and a second to encourage the selection of fewer tokens. At training time, to circumvent the non-differentiability introduced by the intermediate sampling, the gradients for the generator were estimated using a REINFORCE-style procedure 


if “very good” in input: select “very” & return 1; if “not good” in input: select “not” & return 0.1; if “good” in input: select “good” & return 0.8; else select & return 0.5.

Figure 2: Example of handshake.

This intermediate hard selection facilitates the existence of tokens that do not have any contribution to the final prediction. While rcnn aimed for to be the sufficient rationals for each prediction, the model might have learned an internal (emergent) communication protocol (communication) that encodes information from the non-selected via the selected tokens, which we call a handshake. For example, the RCNN could learn a handshake such as the one in Figure 2, where the feature “good” was important in all three cases, but not selected in the first two.

Eliminating handshakes.

Our goal is to gather a dataset such that for all , the set of non-selected tokens, which we denote , has zero contribution to the RCNN’s prediction on . Equivalently, we want to eliminate instances that contain handshakes. We show that:


The proof is in Appendix B. On our example in Figure 2, on the instance “The movie was very good.”, the model selects “very” and predicts a score of . However, if we input the instance consisting of just “very”, the model will not select anything222Our experiments show that the RCNN is capable of not selecting anything and providing its “bias” score as prediction. For example, this happened when we inputted sentences completely irrelevant to the task. and would return a score of . Thus, Equation 7 indeed captures the handshake in this example. From now on, we refer to non-selected tokens as irrelevant or zero-contribution interchangeably.

On the other hand, we note that does not necessarily imply that there was a handshake. There might be tokens (e.g., the or a at the ends of the selection sequence(s)) that might have been selected in the original instance and that become non-selected in the instance formed by without significantly changing the actual prediction. However, since it would be difficult to differentiate between such a case and an actual handshake, we simply prune the dataset by retaining only the instances for which .

At least one clearly relevant feature.

With our pruning above, we ensured that the non-selected tokens have no contribution to the prediction. However, we are yet not sure that all the non-selected tokens are relevant to the prediction. In fact, it is possible that some tokens (such as “the” or “a”) are actually noise, but have been selected only to ensure that the selection is a contiguous sequence (as we mentioned, the RCNN was penalized during training for selecting disconnected tokens). Since we do not want to penalize explainers for not differentiating between noise and zero-contribution features, we further prune the dataset such that there exists at least one selected token which is, without any doubt, clearly relevant for the prediction. To ensure that a given selected token is clearly relevant, we check that, when removing the token , the absolute change in prediction with respect to the original prediction is higher than a significant threshold . Precisely, if for the selected token , we have that , then the selected token is clearly relevant for the prediction.

Figure 3: Partition of the features in our instances.

Thus, we have further partitioned into , where are the clearly relevant tokens, and are the rest of the selected tokens for which we do not know if they are relevant or noise (SDK stands for “selected don’t know”). We see a diagram of this partition in Figure 3. We highlight that simply because a selected token alone did not make a change in prediction higher than a threshold does not mean that this token is not relevant, as it may be essential in combination with other tokens. Our procedure only ensures that the tokens that change the prediction by a given (high) threshold are indeed important and should therefore be ranked higher than any of the non-selected tokens, which have zero contribution. We thus further prune the dataset to retain only the datapoints for which , i.e., there is at least one clearly relevant token per instance.

Evaluation metrics.

First, we note that our procedure does not provide an explainer in itself, since we do not give an actual ranking, nor any contribution weights, and it is possible for some of the tokens in to be even more important than tokens in . However, we guarantee the following two properties:

P1: All tokens in have to be ranked lower than any token in .

P2: The first most important token has to be in .

We evaluate explainers that provide a ranking over the features. We denote by the ranking (in decreasing order of importance) given by an explainer on the features in the instance . Under our two properties above, we define the following error metrics:

  1. [label=()]

  2. Percentage of instances for which the most important token provided by the explainer is among the non-selected tokens:


    where is the indicator function.

  3. Percentage of instances for which at least one non-selected token is ranked higher than a clearly relevant token:

  4. Average number of non-selected tokens ranked higher than any clearly relevant token:


    where is the lowest rank of the clearly relevant tokens.

Metric 1 shows the most dramatic failure: the percentage of times when the explainer tells us that the most relevant token is one of the zero contribution ones. Metric 2 shows the percentage of instances for which there is at least an error in the explanation. Finally, metric 3 quantifies the number of zero-contribution features that were ranked higher than any clearly relevant feature.

width=.999 APPEARANCE AROMA PALATE Model %_first %_misrnk avg_misrnk %_first %_misrnk avg_misrnk %_first %_misrnk avg_misrnk
4.24 24.39 7.02 (24.12) 14.79 32.08 12.74 (33.54) 2.92 13.93 3.48 (17.38)

4.74 16.81 1.16 (7.75) 4.24 13.53 0.83 (7.10) 2.65 9.20 9.25 (9.70)

6.58 28.85 3.54 (12.66) 12.95 31.61 4.41 (16.25) 12.77 29.83 3.70 (13.05)

Table 1:

Error rates of the explainers. The lower the values, the better the explainer. Best results in bold. In parenthesis are the standard deviations when averages are reported.

5 Results and Discussion

In this work, we instantiate our framework on the RCNN model trained on the BeerAdvocate corpus,333 on which the RCNN was initially evaluated (rcnn). BeerAdvocate consists of a total of K human-generated multi-aspect beer reviews, where the three considered aspects are appearance, aroma, and palate. The reviews are accompanied with fractional ratings originally between 0 and 5 for each aspect independently. The RCNN is a regression model with the goal to predict the rating, rescaled between 0 and 1 for simplicity. Three separate RCNNs are trained, one for each aspect independently, with the same default settings.444

With the above procedure, we gathered three datasets , one for each aspect . For each dataset, we know that for each instance , the set of non-selected tokens has zero contribution to the prediction of the model. For obtaining the clearly relevant tokens, we chose a threshold of , since the scores are in , and the ground-truth ratings correspond to {, , , , }. Therefore, a change in prediction of is to be considered clearly significant for this task.

We provide several statistics of our datasets in Appendix A. For example, we provide the average lengths of the reviews, of the selected tokens per review, of the clearly relevant tokens among the selected, and of the non-selected tokens. We note that we usually obtained or clearly relevant tokens per datapoints, showing that our threshold of is likely very strict. However, we prefer to be more conservative in order to ensure high guarantees on our evaluation test. We also provide the percentages of datapoints eliminated in order to ensure the no-handshake condition (Equation 7).

Evaluating explainers.

We test three popular explainers: LIME (lime), SHAP (shap), and L2X (l2x). We used the code of the explainers as provided in the original repositories,555;;
with their default settings for text explanations, with the exception that, for L2X, we set the dimension of the word embeddings to (the same as in the RCNN), and we also allowed training for a maximum of epochs instead of .

As mentioned in Section 3, LIME and SHAP adhere to the feature-additivity perspective, hence our evaluation is not directly targeting these explainers. However, we see in Table 1 that, in practice, LIME and SHAP outperformed L2X on the majority of the metrics, even though L2X is a feature-selection explainer. We hypothesize that a major limitation of L2X is the requirement to know the number of important features per instance. Indeed, L2X learns a distribution over the set of features by maximizing the mutual information between subsets of

features and the response variable, where

is assumed to be known. In practice, one usually does not know how many features per instance a model relied on. To test L2X under real-world circumstances, we used as the average number of tokens highlighted by human annotators on the subset manually annotated by beer-annot. We obtained an average of , , and for the three aspects, respectively.

In Table 1, we see that, on metric 1, all explainers are prone to stating that the most relevant feature is a token with zero contribution, as much as of the time for LIME and of the time for L2X in the aroma aspect. We consider this the most dramatic form of failure. Metric 2 shows that both explainers can rank at least one zero-contribution token higher than a clearly relevant feature, i.e., there is at least one mistake in the predicted ranking. Finally, metric 3 shows that, in average, SHAP only places one zero-contribution token ahead of a clearly relevant token for the first two aspects and around 9 tokens for the third aspect, while L2X places around 3-4 zero-contribution tokens ahead of a clearly relevant one for all three aspects.

Figure 4: Instance from the palate aspect in our evaluation.

Qualitative Analysis.

In Figure 6, we present an example from our dataset of the palate aspect. Examples from the other aspects in Appendix C. The heatmap corresponds to the ranking determined by each explainer, and the intensity of the color decreases linearly with the ranking of the tokens.666While for LIME and SHAP, we could have used the actual weights, for consistency, and since we evaluate the explainers only on their rankings, we keep the ranking-like heatmap. We only show in the heatmap the first ranked tokens, for visibility reasons. Tokens in are in bold, and the clearly relevant tokens from are additionally underlined. Firstly, we notice that both explainers are prone to attributing importance to non-selected tokens, with LIME and SHAP even ranking the tokens “mouthfeel” and “lacing” belonging to as first two (most important). Further, “gorgeous”, the only relevant word used by the model, did not even make it in top 13 tokens for L2X. Instead, L2X gives “taste”, “great”, “mouthfeel” and “lacing” as most important tokens. We note that if the explainer was evaluated by humans assuming that the RCNN behaves reasonably, then this choice would have likely been considered correct.

6 Conclusions and Future Work

In this work, we first shed light on an important distinction between two widely used perspectives of explanations. Secondly, we introduced an off-the-shelf evaluation test for post-hoc explanatory methods under the feature-selection perspective. To our knowledge, this is the first verification framework offering guarantees on the behavior of a non-trivial real-world neural network. We presented the error rates on different metrics for three popular explanatory methods to raise awareness of the types of failures that these explainers can produce, such as incorrectly predicting even the most relevant token. While instantiated on a natural language processing task, our methodology is generic and can be adapted to other tasks and other areas. For example, in computer vision, one could train a neural network that first makes a hard selection of super-pixels to retain, and subsequently makes a prediction based on the image where the non-selected super-pixels have been blurred. The same procedure of checking for zero contribution of non-selected super-pixels would then apply. We also point out that the majority of the current post-hoc explainers are not designed for testing models in a specific domain. For example, LIME, SHAP, and L2X were illustrated both on text and image classification tasks. Therefore, we expect our evaluation to provide a representative view of the fundamental limitations of the explainers. We hope that our work will inspire the community to design similar evaluations, thus increasing the faithfulness of the upcoming explainers to the target models that they aim to explain.


This work was supported by JP Morgan PhD Fellowship 2019-2020, EPSRC Studentship under grant EP/M508111/1, Oxford-DeepMind Graduate Scholarship, Alan Turing Institute under the EPSRC grant EP/N510129/1, EPSRC grant EP/R013667/1, and AXA Chair. We would also like to thank to Jianbo Chen, Misha Denil, Çağlar Gülçehre, Tomáš Kočiský, Yishu Miao, Diederik Royers, Brendan Shillingford, Francesco Visin, and Ziyu Wang for valuable discussions and feedback.


Appendix A Statistics of our gathered datasets

We provide the statistics of our dataset in Table 2. is the number of instances that we retain with our procedure, is the average length of the reviews, and , , and are the average numbers of selected tokens, selected tokens that give an absolute difference of prediction of at least when eliminated individually, and non-selected tokens, respectively. In parenthesis are the standard deviations. The column provides the percentage of instances eliminated from the original BeerAdvocate dataset due to a potential handshake. Finally, shows the percentage of datapoints (out of the non-handshake ones) further eliminated due to the absence of a selected token with absolute effect of at least on the prediction.

Aspect APPEARANCE 20508 145 (79) 16.9 (8.4) 1.33 (0.70) 121 (56) 15.9 73.2 AROMA 7621 139 (74) 11.15 (6.48) 1.16 (0.49) 123 (57) 72.0 58.0 PALATE 16494 153 (76) 9.14 (5.38) 1.21 (0.55) 137 (59) 39.2 66.5

Table 2: Statistics of our datasets for each aspect.

Appendix B Proof for no handshake condition

We show that:


Proof: We note that if there is a handshake in the instance , i.e., at least one non-selected token is actually influencing the final prediction via an internal encoding of its information into the selected tokens, then the model should have a different prediction when is eliminated from the instance, i.e., . Equivalently, if , then could not have been part of a handshake. Thus, if the RCNN gives the same prediction when eliminating all the non-selected tokens, it means that there was no handshake for the instance , and hence the tokens in have indeed zero contribution. Hence, we have that:


Since , Equation 8 rewrites as:


From Equation 2, we further rewrite Equation 9 as:


Since, by definition, , we have that:


Hence, it is sufficient to have in order to satisfy the right-hand-side condition of Equation 11, which finishes our proof.

Appendix C More examples from our evaluation

Figure 5: Instance from the appearance aspect.
Figure 6: Instance from the aroma aspect.