What made you do this? Understanding black-box decisions with sufficient input subsets

10/09/2018 ∙ by Brandon Carter, et al. ∙ 22

Local explanation frameworks aim to rationalize particular decisions made by a black-box prediction model. Existing techniques are often restricted to a specific type of predictor or based on input saliency, which may be undesirably sensitive to factors unrelated to the model's decision making process. We instead propose sufficient input subsets that identify minimal subsets of features whose observed values alone suffice for the same decision to be reached, even if all other input feature values are missing. General principles that globally govern a model's decision-making can also be revealed by searching for clusters of such input patterns across many data points. Our approach is conceptually straightforward, entirely model-agnostic, simply implemented using instance-wise backward selection, and able to produce more concise rationales than existing techniques. We demonstrate the utility of our interpretation method on various neural network models trained on text, image, and genomic data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 16

page 18

page 19

page 20

page 21

page 25

Code Repositories

SufficientInputSubsets

Code for Sufficient Input Subsets Paper


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The rise of neural networks and nonparametric methods in machine learning (ML) has driven significant improvements in prediction capabilities, while simultaneously earning the field a reputation of producing complex black-box models. Vital applications, which could benefit most from improved prediction, are often deemed too sensitive for opaque learning systems. Consider the widespread use of ML for screening people, including models that deny defendants’ bail

(Kleinberg et al., 2018) or reject loan applicants (Sirignano et al., 2018). It is imperative that such decisions can be interpretably rationalized. Interpretability is also crucial in scientific applications, where it is hoped that general principles may be extracted from accurate predictive models (Doshi-Velez and Kim, 2017; Lipton, 2016).

One simple explanation for why a particular black-box decision is reached may be obtained via a sparse subset of the input features whose values form the basis for the model’s decision – a rationale. For text (or image) data, a rationale might consist of a subset of positions in the document (or image) together with the words (or pixel-values) occurring at these positions (see Figures 2 and 8). To ensure interpretations remain fully faithful to an arbitrary model, our rationales do not attempt to summarize the (potentially complex) operations carried out within the model, and instead merely point to the relevant information it uses to arrive at a decision (Lei et al., 2016). For high-dimensional inputs, sparsity of the rationale is imperative for greater interpretability.

Here, we propose a local explanation framework to produce rationales for a learned model that has been trained to map inputs via some arbitrary learned function

. Unlike many other interpretability techniques, our approach is not restricted to vector-valued data and does not require gradients of

. Rather, each input example is solely presumed to have a set of indexable features , where each for . We allow for features that are unordered (set-valued input) and whose number may vary from input to input. A rationale corresponds to a sparse subset of these indices together with the specific values of the features in this subset.

To understand why a certain decision was made for a given input example , we propose a particular rationale called a sufficient input subset (SIS). Each SIS consists of a minimal input pattern present in that alone suffices for to produce the same decision, even if provided no other information about the rest of . Presuming the decision is based on exceeding some pre-specified threshold , we specifically seek a minimal-cardinality subset of the input features such that . Throughout, we use to denote a modified input example in which all information about the values of features outside subset has been masked with features in remaining at their original values. Thus, each SIS characterizes a particular standalone input pattern that drives the model toward this decision, providing sufficient justification for this choice from the model’s perspective, even without any information on the values of the other features in .

In classification settings,

might represent the predicted probability of class

where we decide to assign the input to class if , chosen based on precision/recall considerations. Each SIS in such an application corresponds to a small input pattern that on its own is highly indicative of class , according to our model. Note that by suitably defining and with respect to the predictor outputs, any particular decision for input can be precisely identified with the occurrence of , where higher values of are associated with greater confidence in this decision.

For a given input where , this work presents a simple method to find a complete collection of sufficient input subsets, each satisfying , such that there exists no additional SIS outside of this collection. Each SIS may be understood as a disjoint piece of evidence that would lead the model to the same decision, and why this decision was reached for can be unequivocally attributed to the SIS-collection. Furthermore, global insight on the general principles underlying the model’s decision-making process may be gleaned by clustering the types of SIS extracted across different data points (see Figure 8 and 9). Such insights allow us to compare models based not only on their accuracy, but also on human-determined relevance of the concepts they target. Our method’s simplicity facilitates its utilization by non-experts who may know very little about the models they wish to interrogate.

2 Related Work

Certain neural network variants such as attention mechanisms (Sha and Wang, 2017) and the generator-encoder of Lei et al. (2016) have been proposed as powerful yet human-interpretable learners. Other interpretability efforts have tailored decompositions to certain convolutional/recurrent networks (Murdoch et al., 2018; Olah et al., 2017, 2018), but these approaches are model-specific and only suited for ML experts. Many applications necessitate a model outside of these families, either to ensure supreme accuracy, or if training is done separately with access restricted to a black-box API (Caruana et al., 2015; Tramer et al., 2016).

An alternative model-agnostic approach to interpretability produces local explanations of for a particular input (e.g. an individual classification decision). Popular local explanation techniques produce attribution scores that quantify the importance of each feature in determining the output of at . Examples include LIME, which locally approximates (Ribeiro et al., 2016), saliency maps based on -gradients (Baehrens et al., 2010; Simonyan et al., 2014), Layer-wise Relevance Propagation (Bach et al., 2015), as well as the discrete DeepLIFT approach (Shrikumar et al., 2017) and its continuous variant – Integrated Gradients (IG) (Sundararajan et al., 2017), developed to ensure attributions reflect the cumulative difference in at vs. a reference input. A separate class of input-signal-based explanation techniques such as DeConvNet (Zeiler and Fergus, 2014), Guided Backprop (Springenberg et al., 2015), and PatternNet (Kindermans et al., 2018) employ gradients of in order to identify input patterns that cause to output large values. However, many such gradient-based saliency methods have been found unreliable, depending not only on the learned function , but also on its specific architectural implementation and how inputs are scaled (Kindermans et al., 2017, 2018). More similar to our approach are the recent techniques of Kim et al. (2018) and Chen et al. (2018), which also aim to identify input patterns that best explain certain decisions, but additionally require either a predefined set of such patterns or an auxiliary neural network trained to identify them.

In comparison with the aforementioned methods, our SIS approach presented here is conceptually simple, completely faithful to any type of model, requires no access to gradients of , requires no additional training of the underlying model , and does not require training any auxiliary explanation model. Also related to our subset-selection methodology are the ideas of Li et al. (2017) and Fong and Vedaldi (2017), which for a particular input example aim to identify a minimal subset of features whose deletion causes a substantial drop in such that a different decision would be reached. However, this objective can undesirably produce adversarial artifacts that are not easy to interpret (Fong and Vedaldi, 2017). In contrast, we focus on identifying disjoint minimal subsets of input features whose values suffice to ensure outputs significantly positive predictions, even in the absence of any other information about the rest of the input. While the techniques of Li et al. (2017) and Fong and Vedaldi (2017) produce rationales that remain strongly dependent on the rest of the input outside of the selected feature subset, each rationale revealed by our SIS approach is independently considered by as an entirely sufficient justification for a particular decision in the absence of other information.

3 Methods

Our approach to rationalizing why a particular black-box decision is reached only applies to input examples that meet the decision criterion . For such an input , we aim to identify a SIS-collection of disjoint feature subsets that satisfy the following criteria:

  1. for each

  2. There exists no feature subset for some such that

  3. for (the remaining features outside of the SIS-collection)

Criterion (1) ensures that for any SIS , the values of the features in this subset alone suffice to justify the decision in the absence of any information regarding the values of the other features. To ensure information that is not vital to reach the decision is not included within the SIS, criterion (2) encourages each SIS to contain a minimal number of features, which facilitates interpretability. Finally, we require that our SIS-collection satisfies a notion of completeness via criterion (3), which states that the same decision is no longer reached for the input after the entire SIS-collection has been masked. This implies the remaining feature values of the input no longer contain sufficient evidence for the same decision. Figures 2 and 8 show SIS-collections found in text/image inputs.

Recall that denotes a modified input in which the information about the values of features outside subset is considered to be missing. We construct as new input whose values on features in are identical to those in the original , and whose remaining features are each replaced by a special mask used to represent a missing observation. While certain models are specially adapted to handle inputs with missing observations (Smola et al., 2005)

, this is generally not the case. To ensure our approach is applicable to all models, we draw inspiration from data imputation techniques which are a common way to represent missing data

(Rubin, 1976).

Two popular strategies include hot-deck imputation, in which unobserved values are sampled from their marginal feature distribution, and mean imputation, in which each simply fixed to the average value of feature in the data. Note that for a linear model, these two strategies are expected to produce an identical change in prediction . We find in practice that the change in predictions resulting from either masking strategy is roughly equivalent even for nonlinear models such as neural networks (Figure S12). In this work, we favor the mean-imputation approach over sampling-based imputation, which would be computationally-expensive and nondeterministic (undesirable for facilitating interpretability). One may also view as the baseline input value used by feature attribution methods (Sundararajan et al., 2017; Shrikumar et al., 2017), a value which should not lead to particularly noteworthy decisions. Since our interests primarily lie in rationalizing atypical decisions, the average input arising from mean imputation serves as a suitable baseline. Zeros have also been used to mask image/categorical data (Li et al., 2017), but empirically, this mask appears undesirably more informative than the mean (predictions more affected by zero-masking).

For an arbitrarily complex function over inputs with many features , the combinatorial search to identify sets which satisfy objectives (1)-(3) is computationally infeasible. To find a SIS-collection in practice, we employ a straightforward backward selection strategy, which is here applied separately on an example-by-example basis (unlike standard statistical tools which perform backward selection globally to find a fixed set of features for all inputs). The SIScollection algorithm details our straightforward procedure to identify disjoint SIS subsets that satisfy (1)-(3) approximately (as detailed in §3.1) for an input where .

1 for  do
2      if : return ,…,
Algorithm 1  SIScollection(, , )
1 empty stack while  do
2      Update Push onto top of
return
Algorithm 2  BackSelect(, , )
1 while   do
2      Pop from top of Update
if :  return else:  return None
Algorithm 3  FindSIS(, , , )

Our overall strategy is to find a SIS subset (via BackSelect and FindSIS), mask it out, and then repeat these two steps restricting each search for the next SIS solely to features disjoint from the currently found SIS-collection , until the decision of interest is no longer supported by the remaining feature values. In the BackSelect procedure, denotes the set of remaining unmasked features that are to be considered during backward selection. For the current subset , step 3 in BackSelect identifies which remaining feature produces the minimal reduction in (meaning it least reduces the output of if additionally masked), a question trivially answered by running each of the remaining possibilities through the model. This strategy aims to gradually mask out the least important features in order to reveal the core input pattern that is perceived by the model as sufficient evidence for its decision. Finally, we build our SIS up from the last features omitted during the backward selection, selecting a value just large enough to meet our sufficiency criterion (1). Because this approach always queries a prediction over the joint set of remaining features , it is better suited to account for interactions between these features and ensure their sufficiency (i.e. that ) compared to a forward selection in the opposite direction which builds the SIS upwards one feature at a time by greedily maximizing marginal gains. Throughout its execution, BackSelect attempts to maintain the sufficiency of as the set shrinks.

3.1 Properties of the SIS-collection

Given input features, our algorithm requires evaluations of to identify SIS, but we can achieve by parallelizing each argmax in BackSelect (e.g. batching on GPU). Throughout, let denote the output of SIScollection when applied to a given input for which . Disjointness of these sets is crucial to ensure computational tractability and that the number of SIS per example does not grow huge and hard to interpret. Proposition 1 below proves that each SIS produced by our procedure will satisfy an approximate notion of minimality. Because we desire minimality of the SIS as specified by (2), it is not appropriate to terminate the backward elimination in BackSelect as soon as the sufficiency condition is violated, due to the possible presence of local minima in along the path of subsets encountered during backward selection (as shown in Figure S5).

Proposition 2 additionally guarantees that masking out the entirety of the feature values in the SIS-collection will ensure the model makes a different decision. Given , it is thus necessarily the case that the observed values responsible for this decision lie within the SIS-collection . We point out that for an easily reached decision, where (i.e. this decision is reached even for the average input), our approach will not output any SIS. Because this same decision would likely be anyway reached for a vast number of inputs in the training data (as a sort of default decision), it is conceptually difficult to grasp what particular aspect of the given is responsible.

Proposition 1.

There exists no feature in any set that can be additionally masked while retaining sufficiency of the resulting subset (i.e.  for any ). Also, among all subsets considered during the backward selection phase used to produce , this set has the smallest cardinality of those which satisfy .

Proposition 2.

For , modified by masking all features in the entire SIS-collection , we must have: when .

Unfortunately, nice assumptions like convexity/submodularity are inappropriate for estimated functions in ML. We present various simple forms of practical decision functions for which our algorithms are guaranteed to produce desirable explanations. Example 1 considers interpreting functions of a generalized linear form, Examples 2 & 3 describe functions whose operations resemble generalized logical

OR & AND gates, and Example 4 considers functions that seek out a particular input pattern. Note that features ignored by are always masked in our backward selection and thus never appear in the resulting SIS-collection.

Example 1.

Suppose the input data are vectors and , where is monotonically increasing. We also presume and the data were centered such that each feature has mean zero (for ease of notation). In this case, must satisfy criteria (1)-(3). will consist of the features whose indices correspond to the largest entries of for some suitable that depends on the value of . It is also guaranteed that for any subset of the same cardinality . For each individual feature where , there will be exist a corresponding SIS consisting only of . No SIS will include features whose coefficient , or those whose difference between the observed and average value ( here) is of an opposite sign than the corresponding model coefficient (i.e. ).

Example 2.

Let for some disjoint and functions , such that for the given and threshold : and for each . Such might be functions that model strong interactions between the features in each or look for highly specific value patterns to occur these subsets. In this case, SIScollection will return sets such that .

Example 3.

If and the same conditions from Example 2 are met, then SIScollection will return a single set .

Example 4.

Suppose with where is monotonically decreasing and specifies a fixed pattern of input values for features in a certain subset . For input and threshold choice , SIScollection will return a single set .

Figure 1: Beer review with one sufficient input subset identified for the prediction of each aspect.

Figure 2: Beer review with three disjoint SIS identified for a positive aroma prediction. Underlined are sentences that human labelers manually annotated as capturing the aroma sentiment.
Figure 3: Prediction on rationales only vs. rationale length for various methods in reviews with positive aroma prediction ().
Figure 4: QHS vs. similarity between SIS & annotation in the reviews with positive aroma sentiment (Pearson , -value ).

4 Results

We apply our methods to analyze neural networks for text, DNA, and image data. SIScollection is compared with alternative subset-selection methods for producing rationales (see descriptions in Supplement §S1). Note that our BackSelect procedure determines an ordering of elements, , subsequently construct the SIS. Depictions of each SIS are shaded based on the feature order in (darker = later), which can indicate relative feature importance within the SIS.

In the “Suff. IG,” “Suff. LIME,” and “Suff. Perturb.” (sufficiency constrained) methods, we instead compute the ordering of elements according to the feature attribution values output by integrated gradients (Sundararajan et al., 2017), LIME (Ribeiro et al., 2016), or a perturbative approach that measures the change in prediction when individually masking each feature (see §S1). The rationale subset produced under each method is subsequently assembled using FindSIS exactly as in our approach and thus is guaranteed to satisfy . In the “IG,” “LIME,” and “Perturb.” (length constrained) methods, we use the same previously described ordering , but always select the same number of features in the rationale as in the SIS produced by our method (per example). We also compare against the additional “Top IG” method, in which top features from are added into the rationale until sum of integrated gradients attributions suggests that the rationale has met our sufficiency criterion (see §S1).

SIS Freq.
GCTGAGTCAT 197
ATGACTCAGC 185
GCTGAGTCA-C 83
GCTGAGTCAC 53
GCTGACTCAGCA 42
SIS Freq.
TGCTGA----GCA-TTT 12
GCTGAC---GCA-TTT 8
TGCTGAC---GCA-TT 6
TGCTGAC---GCA-AA 5
TGCTGAC---GCA-AT 4
(a)
Figure 6: (a) KL divergence between JASPAR motifs (known ground truth) and rationales found via various methods. Shown are results for 422 TF datasets (each one summarized by median divergence). (b) In the SIS found in data from one TF, DBSCAN identified two clusters (most frequently-occurring SIS in each shown).
(c) Known JASPAR motif (top) and alignment with cluster modes (bottom).
Figure 5: Two DNA sequences that receive positive TF binding predictions for the MAFF factor (SIS is shaded).

4.1 Sentiment Analysis of Reviews

We first consider a dataset of beer reviews from McAuley et al. (2012). Taking the text of a review as input, different LSTM networks (Hochreiter and Schmidhuber, 1997) are trained to predict user-provided numerical ratings of aspects like aroma, appearance, and palate (details in §S4). Figure 2 shows a sample beer review where we highlight the SIS identified for the LSTM that predicts each aspect. Each SIS only captures sentiment toward the relevant aspect. Figure 2 depicts the SIS-collection identified from a review the LSTM decided to flag for positive aroma.

Figure 4 shows that when the alternative methods described in §4 are length constrained, the rationales they produce often badly fail to meet our sufficiency criterion. Thus, even though the same number of feature values are preserved in the rationale and these alternative methods select the features to which they have assigned the largest attribution values, their rationales lead to significantly reduced outputs compared to our SIS subsets. If the sufficiency constraint is instead enforced for these alternative methods, the rationales they identify become significantly larger than those produced by SIScollection, and also contain many more unimportant features (Table S2, Figure S14).

Benchmarking interpretability methods is difficult because a learned may behave counterintuitively such that seemingly unreasonable model explanations are in fact faithful descriptions of a model’s decision-making process. For some reviews, a human annotator has manually selected which sentences carry the relevant sentiment for the aspect of interest, so we treat these annotations as an alternative rationale for the LSTM prediction. For a review whose true and predicted aroma exceed our decision threshold, we define the quality of human-selected sentences for model explanation where is the human-selected-subset of words in the review (see examples in Figure S18). High variability of in the annotated reviews (Figure 4) indicates the human rationales often do not contain sufficient information to preserve the LSTM’s decision. Figure 4 shows the LSTM makes many decisions based on different subsets of the text than the parts that humans find appropriate for this task. Reassuringly, our SIS more often lie within the selected annotation for reviews with high scores.

4.2 Transcription Factor Binding

We next analyze convolutional neural networks (CNN) used to classify whether a given transcription factor (TF) will bind to a specific DNA sequence

(Zeng et al., 2016). From 422 different datasets of DNA sequences bound-or-not by different TFs (and 422 different CNN models), we extract SIS-collections from sequences with high (top 10%) predicted binding affinity for the TF profiled in each dataset (details in §S2). Figure 6 depicts two input examples and the corresponding identified SIS. Again, rationales produced via our SIS approach are shorter and better at preserving large -values than rationales from other methods (Figures S4 and S4).

Figure 7: Eight clusters of SIS identified from examples of digit 4. Each row contains fifteen random SIS from a single cluster.

(a)                                                    (b)

(a)

Figure 8: (a) SIS for correctly classified 9 (1st column) and when adversarially perturbed toward class 4 (2nd column). (b) SIS for digits 5 that are misclassified as 6 (1st column) and as 0 (2nd column).

To predict binding so accurately, the CNN must faithfully reflect the biological mechanisms that relate the DNA sequence to the probability of TF occupancy. We evaluate the rationales found by our methods against known TF binding motifs from JASPAR (Mathelier et al., 2015), adopting KL divergence between the known motif and each proposed rationale as a quality measure (see §S2.3). Figure 6 shows the divergence of rationales produced by SIScollection is significantly lower than that of rationales identified using other methods (Wilcoxon in all cases). SIS is thus more effective at uncovering the underlying biological principles than the alternative methods we applied.

4.3 MNIST Digit Classification

Finally, we study a 10-way CNN classifier trained on the MNIST handwritten digits data (LeCun et al., 1998). Here, we only consider predicted probabilities for one class of interest at a time and always set as the probability threshold for deciding that an image belongs to the class. We extract the SIS-collection from all corresponding test set examples (details in §S3). Example images and corresponding SIS-collections are shown in Figures 8 and S8. Figure 8a illustrates how the SIS-collection drastically changes for an example of a correctly-classified 9 that has been adversarially manipulated (Carlini and Wagner, 2017) to become confidently classified as the digit 4. Furthermore, these SIS-collections immediately enable us to understand why certain mis-classifications occur (Figure 8b).

4.4 Clustering SIS for General Insights

Identifying the different input patterns that justify a decision can help us better grasp the general operating principles of a model. To this end, we cluster all of the SIS produced by SIScollection applied across a large number of data examples that received the same decision. Clustering is done via DBSCAN, a widely applicable algorithm that merely requires specifying pairwise distances between points (Ester et al., 1996).

We first apply this procedure to the SIS found across all test-set DNA sequences which our CNN model predicted would be bound by some TF. Here, the pairwise distance between two sufficient input subsets is taken to be the Levenshtein (edit) distance. Figure 6 shows the clusters for a particular TF where two SIS clusters were found. Despite no contiguity being enforced in our algorithm, each cluster is comprised of short sequences that clearly capture different aspects of the underlying DNA motif known to bind this TF.

Figure 9: Jointly clustering the MNIST digit 4 SIS from CNN and MLP. We list the percentage of SIS in each cluster stemming from the CNN (rest from MLP).
Figure 10: Predictions by one model on the SIS extracted from the other model in: (a) beer reviews with positive LSTM/CNN aroma predictions, and (b) MNIST digits confidently classified as 4 by CNN/MLP.

We also apply DBSCAN clustering to the SIS found across all MNIST test-examples confidently identified by the CNN as a particular class. Pairwise distances are here defined as the energy distance (Rizzo and Székely, 2016) over pixel locations between two SIS subsets (see §S3.3). Figure 8 depicts the SIS clusters identified for digit 4 (others in Figure S9). These reveal distinct feature patterns learned by the CNN to distinguish 4 from other digits, which are clearly present in the vast majority of test set images confidently classified as a 4. For example, cluster depicts parallel slanted lines, a pattern that never occurs in other digits.

Subsequently, we cluster the SIS found across held-out beer reviews (Test-Fold in Table S1) that received positive aroma predictions from our LSTM network. The distance between two SIS is taken as the Jaccard distance between their bag of words representations. Three clusters depicted in Table 2 (rest in Tables S3S4) reveal isolated phrases that the LSTM associates with positive aromas in the absence of other context.

Clu. SIS #1 SIS #2 SIS #3 SIS #4
smell amazing wonderful nice wonderful nose wonderful amazing amazing amazing
grapefruit mango pineapple pineapple grapefruit pineapple grapefruit hops grapefruit pineapple floyds mango pineapple incredible
creme brulee brulee creme brulee decadent incredible creme brulee creme brulee exceptional
Table 2: Joint clustering of the SIS from beer reviews predicted to have positive aroma by LSTM or CNN. Dashes are used in clusters with under 4 unique SIS. Percentages quantify SIS per cluster from the LSTM.
Clu. LSTM SIS #1 SIS #2 SIS #3 SIS #4
0% delicious - - -
0% very nice - - -
20% rich chocolate very rich chocolate complex smells rich
33% oak chocolate chocolate raisins raisins oak bourbon chocolate oak raisins chocolate
70% complex aroma aroma complex peaches complex aroma complex interesting cherries aroma complex
Table 1: 3 clusters of SIS extracted from beer reviews with positive CNN aroma predictions. Each row shows 4 most frequent unique SIS in a cluster (each SIS shown as ordered word list with text-positions omitted). Each unique SIS can be present many times in one cluster.

The general insights revealed by our SIS-clustering can also be used to compare the operating-behavior of different models. For the beer reviews, we also train a CNN to compare with our existing LSTM (see §S4.6

). For MNIST, we train a multilayer perceptron (MLP) and compare to our existing CNN (see §

S3.5). Both networks exhibit similar performance in each task, so it is not immediately clear which model would be preferable in practice. Figure 10 shows the SIS extracted under one model are typically insufficient to receive the same decision from the other model, indicating these models base their positive predictions on different evidence.

Figure 9 depicts results from a joint clustering of all SIS extracted from held-out MNIST images confidently classified as a 4 by either the MLP or CNN. Evidently, our MNIST-CNN bases its confidence primarily on spatially-contiguous strokes comprising only a small portion of each digit. MLP-decisions are in contrast based on pixels located throughout the digit, demonstrating this model relies more on the global shape of the handwriting. Thus, the CNN is more susceptible to mistaking other (non-digit) handwritten characters for 4s if they happen to share some of the same strokes. Table 2 contains results of jointly clustering the SIS extracted from beer reviews with positive aroma predictions under our LSTM or text-CNN. This CNN tends to learn localized (unigram/bigram) word patterns, while the LSTM identifies more complex multi-word interactions that truly seem more relevant to the target aroma value. Many CNN-SIS are simply phrases with universally-positive sentiment, indicating this model is less capable at distinguishing between positive sentiment toward aroma vs. other aspects such as taste/look.

5 Discussion

This work introduced the idea of interpreting black-box decisions on the basis of sufficient input subsets – minimal input patterns that alone provide sufficient evidence to justify a particular decision. Our methodology is easy to understand for non-experts, applicable to all ML models without any additional training steps, and remains fully faithful to the underlying model without making approximations. While we focus on local explanations of a single decision, clustering the SIS-patterns extracted from many data points reveals insights about a model’s general decision-making process. Given multiple models of comparable accuracy, SIS-clustering can uncover critical operating differences, such as which model is more susceptible to spurious training data correlations or will generalize worse to counterfactual inputs that lie outside the data distribution.

Acknowledgements

We thank Haoyang Zeng and Ge Liu for helping with the TF data/models. This work was supported by NIH Grants R01CA218094, R01HG008363, and R01HG008754.

References

  • Bach et al. (2015) Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.-R., and Samek, W. (2015). On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS One, 10(7):e0130140.
  • Baehrens et al. (2010) Baehrens, D., Schroeter, T., Harmeling, S., Kawanabe, M., Hansen, K., and Müller, K.-R. (2010). How to explain individual classification decisions. Journal of Machine Learning Research, 11:1803–1831.
  • Carlini and Wagner (2017) Carlini, N. and Wagner, D. (2017). Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy.
  • Caruana et al. (2015) Caruana, R., Lou, Y., Gehrke, J., Koch, P., Sturm, M., and Elhadad, N. (2015). Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
  • Chen et al. (2018) Chen, J., Song, L., Wainwright, M. J., and Jordan, M. I. (2018). Learning to explain: An information-theoretic perspective on model interpretation. In International Conference on Machine Learning.
  • Doshi-Velez and Kim (2017) Doshi-Velez, F. and Kim, B. (2017). Towards a rigorous science of interpretable machine learning. arXiv:1702.08608.
  • Ester et al. (1996) Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996). A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining.
  • Fong and Vedaldi (2017) Fong, R. C. and Vedaldi, A. (2017). Interpretable explanations of black boxes by meaningful perturbation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    .
  • Hochreiter and Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735–1780.
  • Kim et al. (2018) Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., and Sayres, R. (2018). Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International Conference on Machine Learning.
  • Kindermans et al. (2017) Kindermans, P.-J., Hooker, S., Adebayo, J., Alber, M., Schütt, K. T., Dähne, S., Erhan, D., and Kim, B. (2017). The (un) reliability of saliency methods. In

    NIPS Workshop: Interpreting, Explaining and Visualizing Deep Learning - Now what?

  • Kindermans et al. (2018) Kindermans, P.-J., Schütt, K. T., Alber, M., Müller, K.-R., Erhan, D., Kim, B., and Dähne, S. (2018). Learning how to explain neural networks: PatternNet and PatternAttribution. In International Conference on Learning Representations.
  • Kleinberg et al. (2018) Kleinberg, J., Lakkaraju, H., Leskovec, J., Ludwig, J., and Mullainathan, S. (2018). Human decisions and machine predictions. The Quarterly Journal of Economics, 133(1):237–293.
  • LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324.
  • Lei et al. (2016) Lei, T., Barzilay, R., and Jaakkola, T. (2016). Rationalizing neural predictions. In

    Empirical Methods in Natural Language Processing

    .
  • Li et al. (2017) Li, J., Monroe, W., and Jurafsky, D. (2017). Understanding neural networks through representation erasure. arXiv:1612.08220.
  • Lipton (2016) Lipton, Z. C. (2016). The mythos of model interpretability. In ICML Workshop on Human Interpretability of Machine Learning.
  • Mathelier et al. (2015) Mathelier, A., Fornes, O., Arenillas, D. J., Chen, C.-y., Denay, G., Lee, J., Shi, W., Shyr, C., Tan, G., Worsley-Hunt, R., et al. (2015). Jaspar 2016: a major expansion and update of the open-access database of transcription factor binding profiles. Nucleic acids research, 44(D1):D110–D115.
  • McAuley et al. (2012) McAuley, J., Leskovec, J., and Jurafsky, D. (2012). Learning attitudes and attributes from multi-aspect reviews. In Data Mining (ICDM), 2012 IEEE 12th International Conference on, pages 1020–1025. IEEE.
  • Murdoch et al. (2018) Murdoch, W. J., Liu, P. J., and Yu, B. (2018). Beyond word importance: Contextual decomposition to extract interactions from LSTMs. In International Conference on Learning Representations.
  • Olah et al. (2017) Olah, C., Mordvintsev, A., and Schubert, L. (2017). Feature visualization. Distill.
  • Olah et al. (2018) Olah, C., Satyanarayan, A., Johnson, I., Carter, S., Schubert, L., Ye, K., and Mordvintsev, A. (2018). The building blocks of interpretability. Distill.
  • Ribeiro et al. (2016) Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). "Why should I trust you?": Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144.
  • Rizzo and Székely (2016) Rizzo, M. L. and Székely, G. J. (2016). Energy distance. Wiley Interdisciplinary Reviews: Computational Statistics, 8(1):27–38.
  • Rubin (1976) Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3):581–592.
  • Sha and Wang (2017) Sha, Y. and Wang, M. D. (2017).

    Interpretable predictions of clinical outcomes with an attention-based recurrent neural network.

    In Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics.
  • Shrikumar et al. (2017) Shrikumar, A., Greenside, P., and Kundaje, A. (2017). Learning important features through propagating activation differences. In International Conference on Machine Learning.
  • Simonyan et al. (2014) Simonyan, K., Vedaldi, A., and Zisserman, A. (2014). Deep inside convolutional networks: Visualising image classification models and saliency maps. In International Conference on Learning Representations.
  • Sirignano et al. (2018) Sirignano, J. A., Sadhwani, A., and Giesecke, K. (2018). Deep learning for mortgage risk. arXiv:1607.02470.
  • Smola et al. (2005) Smola, A. J., Vishwanathan, S., and Hofmann, T. (2005). Kernel methods for missing variables. In Artificial Intelligence and Statistics.
  • Springenberg et al. (2015) Springenberg, J. T., Dosovitskiy, A., Brox, T., and Riedmiller, M. (2015). Striving for simplicity: The all convolutional net. In International Conference on Learning Representations.
  • Sundararajan et al. (2017) Sundararajan, M., Taly, A., and Yan, Q. (2017). Axiomatic attribution for deep networks. In International Conference on Machine Learning.
  • Tramer et al. (2016) Tramer, F., Zhang, F., Juels, A., Reiter, M. K., and Ristenpart, T. (2016). Stealing machine learning models via prediction APIs. In USENIX Security Symposium.
  • Zeiler and Fergus (2014) Zeiler, M. D. and Fergus, R. (2014). Visualizing and understanding convolutional networks. In European Conference on Computer Vision.
  • Zeng et al. (2016) Zeng, H., Edwards, M. D., Liu, G., and Gifford, D. K. (2016). Convolutional neural network architectures for predicting dna–protein binding. Bioinformatics, 32(12):i121.

Supplementary Information for:  What made you do this?

Understanding black-box decisions with sufficient input subsets

List of Figures

References

  • Bach et al. (2015) Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.-R., and Samek, W. (2015). On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS One, 10(7):e0130140.
  • Baehrens et al. (2010) Baehrens, D., Schroeter, T., Harmeling, S., Kawanabe, M., Hansen, K., and Müller, K.-R. (2010). How to explain individual classification decisions. Journal of Machine Learning Research, 11:1803–1831.
  • Carlini and Wagner (2017) Carlini, N. and Wagner, D. (2017). Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy.
  • Caruana et al. (2015) Caruana, R., Lou, Y., Gehrke, J., Koch, P., Sturm, M., and Elhadad, N. (2015). Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
  • Chen et al. (2018) Chen, J., Song, L., Wainwright, M. J., and Jordan, M. I. (2018). Learning to explain: An information-theoretic perspective on model interpretation. In International Conference on Machine Learning.
  • Doshi-Velez and Kim (2017) Doshi-Velez, F. and Kim, B. (2017). Towards a rigorous science of interpretable machine learning. arXiv:1702.08608.
  • Ester et al. (1996) Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996). A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining.
  • Fong and Vedaldi (2017) Fong, R. C. and Vedaldi, A. (2017). Interpretable explanations of black boxes by meaningful perturbation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    .
  • Hochreiter and Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735–1780.
  • Kim et al. (2018) Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., and Sayres, R. (2018). Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International Conference on Machine Learning.
  • Kindermans et al. (2017) Kindermans, P.-J., Hooker, S., Adebayo, J., Alber, M., Schütt, K. T., Dähne, S., Erhan, D., and Kim, B. (2017). The (un) reliability of saliency methods. In

    NIPS Workshop: Interpreting, Explaining and Visualizing Deep Learning - Now what?

  • Kindermans et al. (2018) Kindermans, P.-J., Schütt, K. T., Alber, M., Müller, K.-R., Erhan, D., Kim, B., and Dähne, S. (2018). Learning how to explain neural networks: PatternNet and PatternAttribution. In International Conference on Learning Representations.
  • Kleinberg et al. (2018) Kleinberg, J., Lakkaraju, H., Leskovec, J., Ludwig, J., and Mullainathan, S. (2018). Human decisions and machine predictions. The Quarterly Journal of Economics, 133(1):237–293.
  • LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324.
  • Lei et al. (2016) Lei, T., Barzilay, R., and Jaakkola, T. (2016). Rationalizing neural predictions. In

    Empirical Methods in Natural Language Processing

    .
  • Li et al. (2017) Li, J., Monroe, W., and Jurafsky, D. (2017). Understanding neural networks through representation erasure. arXiv:1612.08220.
  • Lipton (2016) Lipton, Z. C. (2016). The mythos of model interpretability. In ICML Workshop on Human Interpretability of Machine Learning.
  • Mathelier et al. (2015) Mathelier, A., Fornes, O., Arenillas, D. J., Chen, C.-y., Denay, G., Lee, J., Shi, W., Shyr, C., Tan, G., Worsley-Hunt, R., et al. (2015). Jaspar 2016: a major expansion and update of the open-access database of transcription factor binding profiles. Nucleic acids research, 44(D1):D110–D115.
  • McAuley et al. (2012) McAuley, J., Leskovec, J., and Jurafsky, D. (2012). Learning attitudes and attributes from multi-aspect reviews. In Data Mining (ICDM), 2012 IEEE 12th International Conference on, pages 1020–1025. IEEE.
  • Murdoch et al. (2018) Murdoch, W. J., Liu, P. J., and Yu, B. (2018). Beyond word importance: Contextual decomposition to extract interactions from LSTMs. In International Conference on Learning Representations.
  • Olah et al. (2017) Olah, C., Mordvintsev, A., and Schubert, L. (2017). Feature visualization. Distill.
  • Olah et al. (2018) Olah, C., Satyanarayan, A., Johnson, I., Carter, S., Schubert, L., Ye, K., and Mordvintsev, A. (2018). The building blocks of interpretability. Distill.
  • Ribeiro et al. (2016) Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). "Why should I trust you?": Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144.
  • Rizzo and Székely (2016) Rizzo, M. L. and Székely, G. J. (2016). Energy distance. Wiley Interdisciplinary Reviews: Computational Statistics, 8(1):27–38.
  • Rubin (1976) Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3):581–592.
  • Sha and Wang (2017) Sha, Y. and Wang, M. D. (2017).

    Interpretable predictions of clinical outcomes with an attention-based recurrent neural network.

    In Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics.
  • Shrikumar et al. (2017) Shrikumar, A., Greenside, P., and Kundaje, A. (2017). Learning important features through propagating activation differences. In International Conference on Machine Learning.
  • Simonyan et al. (2014) Simonyan, K., Vedaldi, A., and Zisserman, A. (2014). Deep inside convolutional networks: Visualising image classification models and saliency maps. In International Conference on Learning Representations.
  • Sirignano et al. (2018) Sirignano, J. A., Sadhwani, A., and Giesecke, K. (2018). Deep learning for mortgage risk. arXiv:1607.02470.
  • Smola et al. (2005) Smola, A. J., Vishwanathan, S., and Hofmann, T. (2005). Kernel methods for missing variables. In Artificial Intelligence and Statistics.
  • Springenberg et al. (2015) Springenberg, J. T., Dosovitskiy, A., Brox, T., and Riedmiller, M. (2015). Striving for simplicity: The all convolutional net. In International Conference on Learning Representations.
  • Sundararajan et al. (2017) Sundararajan, M., Taly, A., and Yan, Q. (2017). Axiomatic attribution for deep networks. In International Conference on Machine Learning.
  • Tramer et al. (2016) Tramer, F., Zhang, F., Juels, A., Reiter, M. K., and Ristenpart, T. (2016). Stealing machine learning models via prediction APIs. In USENIX Security Symposium.
  • Zeiler and Fergus (2014) Zeiler, M. D. and Fergus, R. (2014). Visualizing and understanding convolutional networks. In European Conference on Computer Vision.
  • Zeng et al. (2016) Zeng, H., Edwards, M. D., Liu, G., and Gifford, D. K. (2016). Convolutional neural network architectures for predicting dna–protein binding. Bioinformatics, 32(12):i121.

S1 Detailed Description of Alternative Methods

In Section 3

, we describe a number of alternative methods for identifying rationales for comparison with our method. We use methods based on integrated gradients Sundararajan17si, LIME limesi, and feature perturbation. Note that integrated gradients is an attribution method which assigns a numerical score to each input feature. LIME likewise assigns a weight to each feature using a local linear regression model for

around . In the perturbative approach, we compute the change in prediction when each feature is individually masked, as in Equation 1 (of Section S4.4). Each of these feature orderings is used to construct a rationale using the FindSIS procedure (Section 3) for the “Suff. IG,” “Suff. LIME,” and “Suff. Perturb.” (sufficiency constrained) methods.

Note that our text classification architecture (described in Section S4.2) encodes discrete words as 100-dimensional continuous word embeddings. The integrated gradients method returns attribution scores for each coordinate of each word embedding. For each word embedding (where each ), we summarize the attributions along the corresponding embedding into a single score using the norm: and compute the ordering by sorting the values.

We use an implementation of integrated gradients for Keras-based models from

https://github.com/hiranumn/IntegratedGradients. In the case of the beer review dataset (Section 4.1), we use the mean embedding vector as a baseline for computing integrated gradients. In the case of TF binding (Section 4.2), we use the uniform mean vector as the baseline reference value. As suggested in Sundararajan17si, we verified that the prediction at the baseline and the integrated gradients sum to approximately the prediction of the input.

For LIME and our beer reviews dataset, we use the approach described in limesi for textual data, where individual words are removed entirely from the input sequence. In our TF binding dataset, LIME replaces bases with the unknown N

base (represented as the uniform-distribution

). We use the implementation of LIME at: https://github.com/marcotcr/lime. The LimeTextExplainer module is used with default parameters, except we set the maximal number of features used in the regression to be the full input length so we can order all input features.

Additionally, we explore methods in which we use the same ordering by these alternative methods but select the same number of input features in the rationale to be the median SIS length in the SIS-collection computed by our method on each example: the “IG,” “LIME,” and “Perturb.” (length constrained) methods. In the TF binding models, we use a baseline of zero vectors such that the integrated gradients result along the encoded sequence is also one-hot. We compute the feature ordering based on the absolute value of the non-zero integrated gradient attributions.

In TF binding data (Section 4.2), we add an additional method, “Top IG,” in which we compute integrated gradients using an all-zeros baseline and order features by attribution magnitude (as in the length constrained IG method). But, we select elements for the rationale by finding the minimum number of elements necessary such that the sum of integrated gradients of those features equals , where is the all-zeros baseline for integrated gradients. Note that for the length constrained and Top IG methods, there is no guarantee of sufficiency for any input subset .

S2 Details of the Transcription Factor Binding Analysis

s2.1 Dataset and Model

We use the motif occupancy datasets111available at http://cnn.csail.mit.edu from zeng2016convolutionalsi, where each dataset originates from a ChIP-seq experiment from the ENCODE project encode2012integratedsi. Each of the 422 datasets studies a particular transcription factor, containing between 600 and 700,000 (median 50,000) 101 base-pair DNA sequences (inputs) each associated with a binary label based on whether the sequence is bound by the TF or not. Each dataset also contains a test set ranging between 150 and 170,000 sequences (median 12,000). Here, the positive and negative classes in each dataset are balanced, and we filter out all sequences containing the unknown base (N). The nucleotide occurring at base position (A, C, G, T) is encoded as a one-hot representation which is fed into the CNN. zeng2016convolutionalsi showed that convolutional neural network architectures outperform other models for this TF binding prediction task.

For each of the 422 prediction tasks, we employ the best-performing “1layer_128motif” architecture from zeng2016convolutionalsi, defined as follows:

  1. Input: (101 x 4) sequence encoding

  2. Convolutional Layer 1

    : Applies 128 kernels of window size 24, with ReLU activation

  3. Global Max Pooling Layer 1

    : Performs global max pooling

  4. Dense Layer 1

    : 32 neurons, with ReLU activation and dropout probability 0.5

  5. Dense Layer 2: 1 neuron (output probability), with sigmoid activation

We hold out 1/8 of each train set for validation and minimize binary cross-entropy using the Adadelta optimizer adadeltasi with default parameter settings in Keras kerassi. We train each model on each of the 422 datasets for 10 epochs (using batch size 128) with early-stopping based on validation loss. Figure 

S2 shows the area under the receiver operating curve (AUC) over the 422 datasets, and we note that the performance of our models closely resembles that in zeng2016convolutionalsi.

s2.2 Rationale length comparison between SIS and other methods

For each dataset, we define the sufficiency threshold as the 90th percentile of the predictive distribution on all test sequences. The distribution of thresholds is shown in Figure S2. We compute the complete set of sufficient input subsets for each corresponding test sequence. Since A,C,G,T nucleotides all occur with similar frequency in this data, our SIS analysis simply masks each base using a uniform embedding (). This is also the standard strategy to represent unknown “N” nucleotides in DNA sequences that typically arise from issues in read quality. We generally find that there is only a single SIS per example for the sequences in these datasets.

Figure S1: Median area under the receiver operating curve (AUC) for all 422 transcription factor binding motif occupancy datasets. The validation set is held-out at training but used to choose model parameters; the test set is not seen until after training.
Figure S2: Thresholds used for identifying sufficient input subsets in TF binding datasets. In each dataset, the threshold is defined as the 90th percentile of the predictive test distribution.

On each dataset, we compute the median rationale length (as number of bases in the rationale). The distribution of median rationale length over all datasets by various methods is shown in Figure S4. Note that for the IG, LIME, and Perturb. methods, rationale length was constrained to the length of the rationales produced by our method. For the Top IG method, neither sufficiency or length constraints are enforced. We see that when the sufficiency constraint is enforced in alternative methods (Suff. IG), the rationales are significantly longer than those identified by SIS. Moreover, as shown in Figure S4, when the sufficiency constraint is not enforced (or the rationale lengths are constrained to the length of SIS rationales) in alternative methods, the rationales have significantly less predictive power, often not satisfying .

Figure S3: Length (number of bases) of rationales identified by various methods. Note that the sufficiency constraint () is only enforced for SIS and Suff. IG. The lengths of IG, LIME, and Perturb. rationales are constrained to the length of SIS rationales.
Figure S4: Prediction on rationale only (all other bases masked) vs. rationale length (number of bases) for various methods in the TF binding task.

s2.3 Evaluation of the quality of TF Rationales

Each rationale is padded with “

N” (unknown) bases to the length of a full input sequence (101 bases) and optimally aligned with the known motif222A JASPAR motif is a

right stochastic matrix

. The columns represent the ACGT DNA bases and the rows a DNA sequence. It represents the marginal probability of the base at position being present with probability . The unknown base “N” receives uniform probability for each of ACGT.
according to the likelihood criterion. The aligned motif is then also padded to the same length, and we compute the divergence between between the rationale and known motif as:

where

is the Kullback-Leibler divergence from

to , and and are distributions over bases (A, C, G, T) at position . Note that as and become more dissimilar, increases. We ensure so is always finite.

S3 Details of the MNIST Analysis

s3.1 Dataset and Model

The MNIST database of handwritten digits contains 60k training images and 10k test images mnistsi. All images are 28x28 grayscale, and we normalize them such that all pixel values are between 0 and 1. We use the convolutional architecture provided in the Keras MNIST CNN example.

333http://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py The architecture is as follows:

  1. Input: (28 x 28 x 1) image, all values

  2. Convolutional Layer 1: Applies 32 3x3 filters with ReLU activation

  3. Convolutional Layer 2: Applies 64 3x3 filters, with ReLU activation

  4. Pooling Layer 1: Performs max pooling with a 2x2 filter and dropout probability 0.25

  5. Dense Layer 1: 128 neurons, with ReLU activation and dropout probability 0.5

  6. Dense Layer 2: 10 neurons (one per digit class), with softmax activation

The Adadelta optimizer  adadeltasi is used to minimize cross-entropy loss on the training set. The final model achieves 99.7% accuracy on the train set and 99.1% accuracy on the held-out test set.

s3.2 Local Minima in Backward Selection

Figure S5: (a) Prediction on remaining image as pixels are masked during backward selection, when our CNN classifier is fed the MNIST digit in (b). The dashed line depicts the threshold . (b) Original image (class 9). (c) SIS if backward selection were to terminate the first time prediction on remaining image drops below 0.7, corresponding to point C in (a) (CNN predicts class 9 with probability 0.700 on this SIS). (d) Actual SIS produced by our FindSIS algorithm, corresponding to point D in (a) (CNN predicts class 9 with probability 0.704 on this SIS).

Figure S5 demonstrates an example MNIST digit for which there exists a local minimum in the backward selection phase of our algorithm to identify the initial SIS. Note that if we were to terminate the backward selection as soon as predictions drop below the decision threshold, the resulting SIS would be overly large, violating our minimality criterion. It is also evident from Figure S5 that the smaller-cardinality SIS in (d), found after the initial local optimum in (c), presents a more interpretable input pattern that enables better understanding of the core motifs influencing our classifier’s decisions. To avoid suboptimal results, it is important to run a complete backward selection sweep until the entire input is masked before building the SIS upward, as done in our SIScollection procedure.

s3.3 Energy Distance Between Image SIS

To cluster SIS from the image data, we compute the pairwise distance between two SIS subsets and as the energy distance energydistancesi between two distributions over the image pixel coordinates that comprise the SIS, and :

Here, is uniformly distributed over the pixels that are selected as part of the SIS subset , is an i.i.d. copy of , and represents the Euclidean norm. Unlike a Euclidean distance between images, our usage of the energy distance takes into account distances between the similar pixel coordinates that comprise each SIS. The energy distance offers a more efficiently computable integral probability metric than the optimal transport distance, which has been widely adopted as an appropriate measure of distance between images.

s3.4 SIS Clustering and Adversarial Analysis

We set the threshold for SIS to ensure that the model is confident in its class prediction (probability of the predicted class is 0.7). Almost all test examples initially have for the top class (Figure S7). We identify all test examples that satisfy this condition and use SIS to identify all sufficient input subsets. The number of sufficient input subsets per digit is shown in Figure S7.

Figure S6: Number of examples per digit in the test set for which for the top class. The complete set of sufficient input subsets is computed for all of these examples.
Figure S7: Distributions of number of sufficient input subsets identified per image, by digit.

We apply our SIScollection algorithm to identify sufficient input subsets on MNIST test digits (Section 4.3). Examples of the complete SIS-collection corresponding to randomly chosen digits are shown in Figure S8. We also cluster all the sufficient input subsets identified for each class (Section 4.4), depicting the results in Figure S9.

(a) Digit 0
(b) Digit 1
(c) Digit 2
(d) Digit 3
(e) Digit 4
(f) Digit 5
(g) Digit 6
(h) Digit 7
(i) Digit 8
(j) Digit 9
Figure S8: Visualization of SIS-collections identified from MNIST digits that are confidently classified by the CNN. For each class, six examples were chosen randomly. For each example, we show the original image (left) and the complete set of sufficient input subsets identified for that example (remaining images in each row). Each individual SIS satisfies for that class.
(a) Digit 0
(b) Digit 1
(c) Digit 2
(d) Digit 3
(e) Digit 4
(f) Digit 5
(g) Digit 6
(h) Digit 7
(i) Digit 8
(j) Digit 9
Figure S9: Clustering all the SIS found for each digit under the CNN model (see Section 4.4). Each row contains images drawn from one cluster. The bottom row (“Misc”) contains a sample of miscellaneous SIS not assigned to any cluster by DBSCAN.

In Figure 8, we show an MNIST image of the digit 9, adversarially perturbed to 4, and the sufficient subsets corresponding to the adversarial prediction. Although a visual inspection of the perturbed image does not really reveal exactly how it has been manipulated, it becomes immediately clear from the SIS-collection for the adversarial image. These sets shows that the perturbation modifies pixels in such a way that input patterns similar to the typical SIS-collection for a 4 (Figure 8) become embedded in the image. The adversarial manipulation was done using the Carlini-Wagner (CW2) attack444Implemented in the cleverhans library of papernot2017cleverhans carlini2017towardssi with a confidence parameter of 10. The CW2 attack tries to find the minimal change to the image, with respect to the norm, that will lead the image to be misclassified. carlini2017adversarial demonstrate it to be one of the strongest extant adversarial attacks.

s3.5 Understanding Differences Between MNIST Classifiers

We use SIS and our clustering procedure to understand and visualize differences in features learned by two different models trained on the same MNIST digit classification task. In addition to the previously-described CNN model (see Section S3.1), we also trained a simple multilayer perceptron (MLP) on the same task. The MLP architecture is as follows:

  1. Input: 784-dimensional (flattened) image, all values

  2. Dense Layer 1: 250 neurons, ReLU activation, and dropout probability 0.2

  3. Dense Layer 2: 250 neurons, ReLU activation, and dropout probability 0.2

  4. Dense Layer 3: 10 neurons (one per digit class), with softmax activation

As with the CNN, Adadelta adadeltasi is used to minimize cross-entropy loss on the training set. The final MLP model achieves 99.7% accuracy on the train set and 98.3% accuracy on the test set, which is close to the performance of the CNN (see Section S3.1).

We apply the same procedure as in Section 4.3 to extract the SIS-collection from all applicable test images using the MLP. To understand differences between the feature patterns that each model has learned to associate with predicting each digit, we combine all SIS (from both models for a particular class) and run our clustering procedure (see Section 4.4 and Figure 9). In the resulting clustering, we list what percentage of the SIS in each cluster stem from the CNN vs. the MLP. Most clusters contain examples purely from a single model, indicating the two models have learned to associate different feature patterns with the target class (Figure 9), which was chosen to be the digit 4 in this case.

For further comparison, we include clustering results for the SIS extracted from the MLP as evidence for digits 4 and 7 (Figure S10). Additionally, Figure S11 shows all of the SIS extracted from example digits from these classes applying our procedure on the MLP.

Digit 4
Digit 7
Figure S10: Clustering all the SIS identified by our method on digits 4 and 7 under the MLP model (see Section 4.4). Each row contains images drawn from one cluster. The bottom row (“Misc”) contains a sample of miscellaneous SIS not assigned to any cluster by DBSCAN. Compare to the SIS-clustering from our CNN model (Figure S9).
Digit 4
Digit 7
Figure S11: Visualization of SIS-collections identified for MNIST digits 4 and 7 under the MLP model. For each class, six examples were chosen randomly. For each example, we show the original image (left) and the complete set of sufficient input subsets identified for that example (remaining images in each row). Note that each individual SIS satisfies for that class. Compare to the SIS extracted from our CNN (Figure S8).

S4 Details of the Beer Reviews Sentiment Analysis

s4.1 Beer Reviews Data Description

Following Lei16si, we use a preprocessed version of the BeerAdvocate555https://www.beeradvocate.com/ dataset666http://snap.stanford.edu/data/web-BeerAdvocate.html which contains decorrelated numerical ratings toward three aspects: aroma, appearance, and palate (each normalized to ). Dataset statistics can be found in Table S1. Reviews were tokenized by converting to lowercase and filtering punctuation, and we used a vocabulary containing the top 10,000 most common words. mcauley2012learningsi also provide a subset of human-annotated reviews, in which humans manually selected full sentences in each review that describe the relevant aspects. This annotated set was never seen during training and used solely as part of our evaluation.

s4.2 Model Architecture and Training

Long short-term memory (LSTM) networks are commonly employed for natural language tasks such as sentiment analysis wang2016attentionsi, radford2017learning. We use a recurrent neural network (RNN) architecture with two stacked LSTMs as follows:

  1. Input/Embeddings Layer: Sequence with 500 timesteps, the word at each timestep is represented by a (learned) 100-dimensional embedding

  2. LSTM Layer 1: 200-unit recurrent layer with LSTM (forward direction only)

  3. LSTM Layer 2: 200-unit recurrent layer with LSTM (forward direction only)

  4. Dense: 1 neuron (sentiment output), sigmoid activation

With this architecture, we use the Adam optimizer adamsi to minimize mean squared error (MSE) on the training set. We use a held-out set of 3,000 examples for validation (sampled at random from the pre-defined test set from Lei16si). Our test set consists of the remaining 7,000 test examples. Training results are shown in Table S1.

Aspect Fold Size MSE Pearson
Appearance Train 80,000 0.016 0.864
Validation 3,000 0.024 0.783
Test 7,000 0.023 0.801
Annotation 994 0.020 0.563
Aroma Train 70,000 0.014 0.873
Validation 3,000 0.024 0.767
Test 7,000 0.025 0.756
Annotation 994 0.021 0.598
Palate Train 70,000 0.016 0.835
Validation 3,000 0.029 0.680
Test 7,000 0.028 0.694
Annotation 994 0.016 0.592
Table S1: Summary and performance statistics (mean squared error (MSE) and Pearson correlation coefficient ) for beer reviews data and LSTM models.

s4.3 Imputation Strategies: Mean vs. Hot-deck

In Section 3, we discuss the problem of masking input features. Here, we show that the mean-imputation approach (in which missing inputs are masked with a mean embedding, taken over the entire vocabulary) produces a nearly identical change in prediction to a nondeterministic hot-deck approach (in which missing inputs are replaced by randomly sampling feature-values from the data). Figure S12 shows the change in prediction by both imputation techniques after drawing a training example and word (both uniformly at random) and replacing with either the mean embedding or a randomly selected word (drawn from the vocabulary, based on counts in the training corpus). This procedure is repeated 10,000 times. Both resulting distributions have mean near zero (, ), and the distribution for mean embedding is slightly narrower (, ). We conclude that mean-imputation is a suitable method for masking information about particular feature values in our SIS analysis.

We also explored other options for masking word information, e.g. replacement with a zero embedding, replacement with the learned <PAD> embedding, and simply removing the word entirely from the input sequence, but each of these alternative options led to undesirably larger changes in predicted values as a result of masking, indicating they appear more informative to than replacement via the feature-mean.

Figure S12: Change in prediction () after masking a randomly chosen word with mean imputation or hot-deck imputation. 10,000 replacements were sampled from the aroma beer reviews training set.

s4.4 Feature Importance Scores

For each feature in the input sequence, we quantify its marginal importance by individually perturbing only this feature:

(1)

Note that these marginal Feature Importance scores are identical to those of the Perturb. method described in Section S1. The marginal Feature Importance scores are summarized in Table S2 and Figure S14. Compared to the Suff. IG and Suff. LIME methods, our SIScollection technique produces rationales that are much shorter and contain fewer irrelevant (i.e. not marginally important) features (Table S2, Figures S14 and S14). Note that by construction, the rationales of the Suff. Perturb. method contain features with the greatest Feature Importance, since this precisely how the ranking in Suff. Perturb. is defined.

Method Rationale Length (% of text) Marginal Perturbed Feature Importance
Med. Max (vs. SIS) Med. (Rationale) Med. (Other) (vs. SIS)
SIS 3.9% 17.3% 0.0112 1.50e-05
Suff. IG 7.7% 89.7% 5e-26 0.0068 1.85e-05 3e-42
Suff. LIME 7.2% 84.0% 4e-23 0.0075 1.87e-05 1e-35
Suff. Perturb. 5.1% 18.3% 1e-06 0.0209 1.90e-05 1e-72
Table S2: Statistics for rationale length and feature importance in aroma prediction. For rationale length, median and max indicate percentage of input text in the rationale. For marginal perturbed feature importance, we indicate the median importance of features in rationales and features from the other (non-rationale) text. -values are computed using a Wilcoxon rank-sum test.
Figure S13: Importance of individual features in the rationales for aroma prediction in beer reviews
Figure S14: Length of rationales for aroma prediction

s4.5 Additional Results for Aroma aspect

We apply our method to the set of reviews containing sentence-level annotations. Note that these reviews (and the human annotations) were not seen during training. We choose thresholds , for strong positive and strong negative sentiment, respectively, and extract the complete set of sufficient input subsets using our method. Note that in our formulation above, we apply our method to inputs where . For the sentiment analysis task, we analogously apply our method for both and , where the model predicts either strong positive or strong negative sentiment, respectively. These thresholds were set empirically such that they were sufficiently apart, based on the distribution of predictions (Figure S16). For most reviews, SIScollection outputs just one or two SIS sets (Figure S16).

Figure S15: Predictive distribution on the annotation set (held-out) using the LSTM model for aroma. Vertical lines indicate decision thresholds (, ) selected for SIScollection.
Figure S16: Number of sufficient input subsets for aroma identified by SIScollection per example.

We analyzed the predictor output following the elimination of each feature in the BackSelect procedure (Section 3). Figure S17 shows the LSTM output on the remaining unmasked text at each iteration of BackSelect, for all examples. This figure reveals that only a small number of features are needed by the model in order to make a strong prediction (most features can be removed without changing the prediction). We see that as those final, critical features are removed, there is a rapid, monotonic decrease in output values. Finally, we see that the first features to be removed by BackSelect are those which generally provide negative evidence against the decision.

Figure S17: Prediction history on remaining (unmasked) text at each step of the BackSelect procedure, for examples where aroma sentiment is predicted.
Figure S18: Beer reviews (aroma) in which human-selected sentences (underlined) are aligned well (top) and poorly (bottom) with predictive model. Fraction of SIS in the human sentences corresponds accordingly. In the bottom example (poor alignment between human-selection and predictive model), our procedure has surfaced a case where the LSTM has learned features that diverge from what a human would expect (and may suggest overfitting).
Cluster SIS #1 Freq. SIS #2 Freq. SIS #3 Freq. SIS #4 Freq.
smell amazing wonderful 2 nice wonderful nose 2 wonderful amazing 2 amazing amazing 2
grapefruit mango pineapple 2 pineapple grapefruit pineapple grapefruit 1 hops grapefruit pineapple floyds 1 mango pineapple incredible 1
nice smell citrus nice grapefruit taste 1 smell great complex ripe taste 1 nice smell nice hop smell pine taste 1 love nice nice smell bliss taste 1
fresh great fantastic taste 1 rich great fantastic hoped 1 fantastic cherries fantastic 1 everyone great snifters fantastic 1
awesome bounds 1 awesome grapefruit awesome 1 awesome awesome pleasing 1 awesome nailed nailed 1
creme brulee brulee 3 creme brulee decadent 1 incredible creme brulee 1 creme brulee exceptional 1
oak vanilla chocolate cinnamon vanilla oak love 1 dose oak chocolate vanilla acidic 1 vanilla figs oak thinner great 1 chocolate aroma oak vanilla dessert 1
Table S3: All clusters of sufficient input subsets extracted from reviews from the test set predicted to have positive aroma by the LSTM. Frequency indicates the number of occurrences of the SIS in the cluster.
Cluster SIS #1 Freq. SIS #2 Freq. SIS #3 Freq. SIS #4 Freq.
awful 15 skunky skunky 9 skunky t 7 skunky taste 6
garbage 3 taste garbage 1 garbage avoid 1 garbage rice 1
vomit 16 - - - - - -
gross rotten 1 rotten forte 1 awkward rotten 1 rotten offputting 1
rancid horrid 1 rancid t 1 rancid 1 rancid avoid 1
rice t rice 2 rice rice 1 rice tasteless 1 budweiser rice 1
Table S4: All clusters of sufficient input subsets extracted from reviews from the test set predicted to have negative aroma by the LSTM. Frequency indicates the number of occurrences of the SIS in the cluster. Dashes are used in clusters with under 4 unique SIS.

s4.6 Understanding Differences Between Sentiment Predictors

We demonstrate how our SIS-clustering procedure can be used to understand differences in the types of concepts considered important by different neural network architectures. In addition to the LSTM (see Section S4.2), we trained a convolutional neural network (CNN) on the same sentiment analysis task (on the aroma aspect). The CNN architecture is as follows:

  1. Input/Embeddings Layer: Sequence with 500 timesteps, the word at each timestep is represented by a (learned) 100-dimensional embedding

  2. Convolutional Layer 1: Applies 128 filters of window size 3 over the sequence, with ReLU activation

  3. Max Pooling Layer 1: Max-over-time pooling, followed by flattening, to produce a representation

  4. Dense: 1 neuron (sentiment output), sigmoid activation

Note that a new set of embeddings was learned with the CNN. As with the LSTM model, we use Adam adamsi to minimize MSE on the training set. For the aroma aspect, this CNN achieves 0.016 (0.850), 0.025 (0.748), 0.026 (0.741), 0.014 (0.662) MSE (and Pearson ) on the Train, Validation, Test, and Annotation sets, respectively. We note that this performance is very similar to that from the LSTM (see Table S1).

We apply our procedure to extract the SIS-collection from all applicable test examples using the CNN, as in Section 4.1. Figure 10 shows the predictions from one model (LSTM or CNN) when fed input examples that are SIS extracted with respect to the other model (for reviews predicted to have positive sentiment toward the aroma aspect). For example, in Figure 10, “CNN SIS Preds by LSTM” refers to predictions made by the LSTM on the set of sufficient input subsets produced by applying our SIScollection procedure on all examples for which .777For experiments involving clustering and/or comparing different models, we use examples drawn from the Test fold (instead of Annotation fold, see Table S1) to consider a larger number of examples. Since the word embeddings are model-specific, we embed each SIS using the embeddings of the model making the prediction (note that while the embeddings are different, the vocabulary is the same across the models).

In Table 2, we show five example clusters (and cluster composition) resulting from clustering the combined set of all sufficient input subsets extracted by the LSTM and CNN on reviews in the test set for which a model predicts positive sentiment toward the aroma aspect. The complete clustering on reviews receiving positive sentiment predictions is shown in Table S5 and in Table S6 for reviews receiving negative sentiment predictions.

Cluster SIS #1 Freq. SIS #2 Freq. SIS #3 Freq. SIS #4 Freq.
(LSTM: 20%) rich chocolate 13 very rich 9 chocolate complex 5 smells rich 4
(LSTM: 21%) great 248 amazing 119 wonderful 112 fantastic 75
(LSTM: 47%) best smelling 23 pineapple mango 6 mango pineapple 6 pineapple grapefruit 5
(LSTM: 5%) excellent 42 excellent flemish flemish 1 excellent excellent phenomenal 1 - -
(LSTM: 33%) oak chocolate 2 chocolate raisins raisins oak bourbon 1 chocolate oak 1 raisins chocolate 1
(LSTM: 5%) goodness 19 watering goodness 1 - - - -
(LSTM: 24%) pumpkin pie 25 huge pumpkin aroma pumpkin pie 1 aroma perfect pumpkin pie taste 1 smell pumpkin nutmeg cinnamon pie 1
(LSTM: 5%) jd 13 tremendous 8 tremendous jd 1 - -
(LSTM: 40%) brulee 14 creme brulee brulee 3 creme creme 1 creme brulee amazing 1
(LSTM: 0%) s wow 20 - - - - - -
(LSTM: 0%) delicious 56 - - - - - -
(LSTM: 0%) very nice 23 - - - - - -
(LSTM: 70%) complex aroma 5 aroma complex peaches complex 1 aroma complex interesting cherries 1 aroma complex 1
Table S5: Joint clustering of the SIS extracted from beer reviews predicted to have positive aroma by LSTM or CNN model. Frequency indicates the number of occurrences of the SIS in the cluster. Percentages quantify SIS per cluster from the LSTM. Dashes are used in clusters with under 4 unique SIS.
Cluster SIS #1 Freq. SIS #2 Freq. SIS #3 Freq. SIS #4 Freq.
(LSTM: 29%) not 247 no 105 bad 104 macro 94
(LSTM: 100%) gross rotten 1 - - - - - -
(LSTM: 100%) rotten garbage 1 - - - - - -
(LSTM: 62%) vomit 26 - - - - - -
(LSTM: 21%) budweiser 22 sewage budweiser 1 metal budweiser 1 budweiser budweiser budweiser 1
(LSTM: 100%) garbage rice 1 - - - - - -
(LSTM: 3%) n’t 19 adjuncts 14 n’t adjuncts 1 - -
(LSTM: 0%) faint 82 - - - - - -
(LSTM: 0%) adjunct 42 - - - - - -
Table S6: Joint clustering of the SIS extracted from beer reviews predicted to have negative aroma by LSTM or CNN model. Frequency indicates the number of occurrences of the SIS in the cluster. Percentages quantify SIS per cluster from the LSTM. Dashes are used in clusters with under 4 unique SIS.

s4.7 Results for Appearance and Palate aspects

For posterity, we include results here from repeating the analysis in our paper for the two other non-aroma aspects measured in the beer reviews data: appearance and palate.

Figure S19: Change in appearance prediction () after masking a randomly chosen word with mean imputation or hot-deck imputation. 10,000 replacements were sampled from the appearance beer reviews training set.
Figure S20: Predictive distribution on the annotation set (held-out) using the LSTM model for appearance. Vertical lines indicate decision thresholds (, ) selected for SIScollection.
Figure S21: Number of sufficient input subsets for appearance identified by SIScollection per example.
Figure S22: Length of rationales for appearance prediction
Figure S23: Importance of individual features for appearance prediction in beer review
Method Rationale Length (% of text) Marginal Perturbed Feature Importance
Med. Max (vs. SIS) Med. (Rationale) Med. (Other) (vs. SIS)
SIS 2.6% 10.6% 0.0183 1.72e-05
Suff. IG 3.7% 89.3% 2e-09 0.0184 2.41e-05 1e-02
Suff. LIME 3.7% 98.2% 8e-09 0.0167 2.38e-05 6e-09
Suff. Perturb. 3.0% 14.9% 9e-03 0.0339 2.51e-05 5e-44
Table S7: Statistics for rationale length and feature importance in appearance prediction. For rationale length, median and max indicate percentage of input text in the rationale. For marginal perturbed feature importance, we indicate the median importance of features in rationales and features from the other (non-rationale) text. -values are computed using a Wilcoxon rank-sum test.
Figure S24: QHS vs. fraction of SIS in human rationale for appearance prediction
Figure S25: Prediction on rationales only vs. rationale length for various methods in positive sentiment examples for appearance. The threshold for sufficiency was .
Figure S26: Prediction history on remaining (unmasked) text at each step of the BackSelect procedure, for examples where appearance sentiment is predicted.
Cluster SIS #1 Freq. SIS #2 Freq. SIS #3 Freq. SIS #4 Freq.
beautiful 376 nitro 51 looks great 38 great looking 32
gorgeous 83 - - - - - -
beautifully 7 absolutely beautifully 2 beautifully pillowy 1 beautifully bands 1
brilliant 5 brilliant slowly 1 wonderfully brilliant 1 appearance brilliant 1
lovely looking 3 black lovely 3 impressive lovely 1 lovely crystal 1
Table S8: All clusters of sufficient input subsets extracted from reviews from the test set predicted to have positive appearance by the LSTM. Frequency indicates the number of occurrences of the SIS in the cluster. Dashes are used in clusters with under 4 unique SIS.
Cluster SIS #1 Freq. SIS #2 Freq. SIS #3 Freq. SIS #4 Freq.
piss 46 zero 38 water water 37 water 27
unappealing 12 floaties 12 floaties unappealing 1 - -
ugly 12 - - - - - -
Table S9: All clusters of sufficient input subsets extracted from reviews from the test set predicted to have negative appearance by the LSTM. Frequency indicates the number of occurrences of the SIS in the cluster. Dashes are used in clusters with under 4 unique SIS.
Figure S27: Change in palate prediction () after masking a randomly chosen word with mean imputation or hot-deck imputation. 10,000 replacements were sampled from the palate beer reviews training set.
Figure S28: Predictive distribution on the annotation set (held-out) using the LSTM model for palate. Vertical lines indicate decision thresholds (, ) selected for SIScollection.
Figure S29: Number of sufficient input subsets for palate identified by SIScollection per example.
Figure S30: Length of rationales for palate prediction
Figure S31: Importance of individual features in beer review palate rationales
Method Rationale Length (% of text) Marginal Perturbed Feature Importance
Med. Max (vs. SIS) Med. (Rationale) Med. (Other) (vs. SIS)
SIS 2.4% 13.7% 0.0210 -8.94e-07
Suff. IG 3.2% 56.1% 2e-06 0.0163 -9.54e-07 6e-10
Suff. LIME 3.0% 57.0% 7e-06 0.0173 -1.19e-06 2e-07
Suff. Perturb. 2.8% 11.8% 3e-03 0.0319 -1.25e-06 5e-26
Table S10: Statistics for rationale length and feature importance in palate prediction. For rationale length, median and max indicate percentage of input text in the rationale. For marginal perturbed feature importance, we indicate the median importance of features in rationales and features from the other (non-rationale) text. -values are computed using a Wilcoxon rank-sum test.
Figure S32: QHS vs. fraction of SIS in human rationale for palate prediction
Figure S33: Prediction on rationales only vs. rationale length for various methods in positive sentiment examples for palate. The threshold for sufficiency was .
Figure S34: Prediction history on remaining (unmasked) text at each step of the BackSelect procedure, for examples where palate sentiment is predicted.
Cluster SIS #1 Freq. SIS #2 Freq. SIS #3 Freq. SIS #4 Freq.
smooth creamy 27 silky smooth 20 mouthfeel perfect 16 creamy perfect 12
mouthfeel exceptional 6 exceptional mouthfeel 4 - - - -
perfect 50 perfect perfect 6 - - - -
smooth velvety 6 velvety smooth 6 - - - -
silk 11 - - - - - -
smooth perfect 8 mouth smooth perfect 1 perfect smooth 1 - -
perfect great 5 great perfect 2 feels perfect 2 perfect feels great 1
Table S11: All clusters of sufficient input subsets extracted from reviews from the test set predicted to have positive palate by the LSTM. Frequency indicates the number of occurrences of the SIS in the cluster. Dashes are used in clusters with under 4 unique SIS.
Cluster SIS #1 Freq. SIS #2 Freq. SIS #3 Freq. SIS #4 Freq.
overcarbonated 12 mouthfeel overcarbonated 3 way overcarbonated 1 overcarbonated mouthfeel 1
watery 302 thin 238 flat 118 mouthfeel thin 33
too carbonation masks 1 too carbonation d 1

mouthfeel odd too too

1 too carbonated admire 1
lack carbonation 4 carbonation lack 4 carbonation hurts 2 issue lack hurts 1
Table S12: All clusters of sufficient input subsets extracted from reviews from the test set predicted to have negative palate by the LSTM. Frequency indicates the number of occurrences of the SIS in the cluster.

apa interpretability