SufficientInputSubsets
Code for Sufficient Input Subsets Paper
view repo
Local explanation frameworks aim to rationalize particular decisions made by a blackbox prediction model. Existing techniques are often restricted to a specific type of predictor or based on input saliency, which may be undesirably sensitive to factors unrelated to the model's decision making process. We instead propose sufficient input subsets that identify minimal subsets of features whose observed values alone suffice for the same decision to be reached, even if all other input feature values are missing. General principles that globally govern a model's decisionmaking can also be revealed by searching for clusters of such input patterns across many data points. Our approach is conceptually straightforward, entirely modelagnostic, simply implemented using instancewise backward selection, and able to produce more concise rationales than existing techniques. We demonstrate the utility of our interpretation method on various neural network models trained on text, image, and genomic data.
READ FULL TEXT VIEW PDFCode for Sufficient Input Subsets Paper
The rise of neural networks and nonparametric methods in machine learning (ML) has driven significant improvements in prediction capabilities, while simultaneously earning the field a reputation of producing complex blackbox models. Vital applications, which could benefit most from improved prediction, are often deemed too sensitive for opaque learning systems. Consider the widespread use of ML for screening people, including models that deny defendants’ bail
(Kleinberg et al., 2018) or reject loan applicants (Sirignano et al., 2018). It is imperative that such decisions can be interpretably rationalized. Interpretability is also crucial in scientific applications, where it is hoped that general principles may be extracted from accurate predictive models (DoshiVelez and Kim, 2017; Lipton, 2016).One simple explanation for why a particular blackbox decision is reached may be obtained via a sparse subset of the input features whose values form the basis for the model’s decision – a rationale. For text (or image) data, a rationale might consist of a subset of positions in the document (or image) together with the words (or pixelvalues) occurring at these positions (see Figures 2 and 8). To ensure interpretations remain fully faithful to an arbitrary model, our rationales do not attempt to summarize the (potentially complex) operations carried out within the model, and instead merely point to the relevant information it uses to arrive at a decision (Lei et al., 2016). For highdimensional inputs, sparsity of the rationale is imperative for greater interpretability.
Here, we propose a local explanation framework to produce rationales for a learned model that has been trained to map inputs via some arbitrary learned function
. Unlike many other interpretability techniques, our approach is not restricted to vectorvalued data and does not require gradients of
. Rather, each input example is solely presumed to have a set of indexable features , where each for . We allow for features that are unordered (setvalued input) and whose number may vary from input to input. A rationale corresponds to a sparse subset of these indices together with the specific values of the features in this subset.To understand why a certain decision was made for a given input example , we propose a particular rationale called a sufficient input subset (SIS). Each SIS consists of a minimal input pattern present in that alone suffices for to produce the same decision, even if provided no other information about the rest of . Presuming the decision is based on exceeding some prespecified threshold , we specifically seek a minimalcardinality subset of the input features such that . Throughout, we use to denote a modified input example in which all information about the values of features outside subset has been masked with features in remaining at their original values. Thus, each SIS characterizes a particular standalone input pattern that drives the model toward this decision, providing sufficient justification for this choice from the model’s perspective, even without any information on the values of the other features in .
In classification settings,
might represent the predicted probability of class
where we decide to assign the input to class if , chosen based on precision/recall considerations. Each SIS in such an application corresponds to a small input pattern that on its own is highly indicative of class , according to our model. Note that by suitably defining and with respect to the predictor outputs, any particular decision for input can be precisely identified with the occurrence of , where higher values of are associated with greater confidence in this decision.For a given input where , this work presents a simple method to find a complete collection of sufficient input subsets, each satisfying , such that there exists no additional SIS outside of this collection. Each SIS may be understood as a disjoint piece of evidence that would lead the model to the same decision, and why this decision was reached for can be unequivocally attributed to the SIScollection. Furthermore, global insight on the general principles underlying the model’s decisionmaking process may be gleaned by clustering the types of SIS extracted across different data points (see Figure 8 and 9). Such insights allow us to compare models based not only on their accuracy, but also on humandetermined relevance of the concepts they target. Our method’s simplicity facilitates its utilization by nonexperts who may know very little about the models they wish to interrogate.
Certain neural network variants such as attention mechanisms (Sha and Wang, 2017) and the generatorencoder of Lei et al. (2016) have been proposed as powerful yet humaninterpretable learners. Other interpretability efforts have tailored decompositions to certain convolutional/recurrent networks (Murdoch et al., 2018; Olah et al., 2017, 2018), but these approaches are modelspecific and only suited for ML experts. Many applications necessitate a model outside of these families, either to ensure supreme accuracy, or if training is done separately with access restricted to a blackbox API (Caruana et al., 2015; Tramer et al., 2016).
An alternative modelagnostic approach to interpretability produces local explanations of for a particular input (e.g. an individual classification decision). Popular local explanation techniques produce attribution scores that quantify the importance of each feature in determining the output of at . Examples include LIME, which locally approximates (Ribeiro et al., 2016), saliency maps based on gradients (Baehrens et al., 2010; Simonyan et al., 2014), Layerwise Relevance Propagation (Bach et al., 2015), as well as the discrete DeepLIFT approach (Shrikumar et al., 2017) and its continuous variant – Integrated Gradients (IG) (Sundararajan et al., 2017), developed to ensure attributions reflect the cumulative difference in at vs. a reference input. A separate class of inputsignalbased explanation techniques such as DeConvNet (Zeiler and Fergus, 2014), Guided Backprop (Springenberg et al., 2015), and PatternNet (Kindermans et al., 2018) employ gradients of in order to identify input patterns that cause to output large values. However, many such gradientbased saliency methods have been found unreliable, depending not only on the learned function , but also on its specific architectural implementation and how inputs are scaled (Kindermans et al., 2017, 2018). More similar to our approach are the recent techniques of Kim et al. (2018) and Chen et al. (2018), which also aim to identify input patterns that best explain certain decisions, but additionally require either a predefined set of such patterns or an auxiliary neural network trained to identify them.
In comparison with the aforementioned methods, our SIS approach presented here is conceptually simple, completely faithful to any type of model, requires no access to gradients of , requires no additional training of the underlying model , and does not require training any auxiliary explanation model. Also related to our subsetselection methodology are the ideas of Li et al. (2017) and Fong and Vedaldi (2017), which for a particular input example aim to identify a minimal subset of features whose deletion causes a substantial drop in such that a different decision would be reached. However, this objective can undesirably produce adversarial artifacts that are not easy to interpret (Fong and Vedaldi, 2017). In contrast, we focus on identifying disjoint minimal subsets of input features whose values suffice to ensure outputs significantly positive predictions, even in the absence of any other information about the rest of the input. While the techniques of Li et al. (2017) and Fong and Vedaldi (2017) produce rationales that remain strongly dependent on the rest of the input outside of the selected feature subset, each rationale revealed by our SIS approach is independently considered by as an entirely sufficient justification for a particular decision in the absence of other information.
Our approach to rationalizing why a particular blackbox decision is reached only applies to input examples that meet the decision criterion . For such an input , we aim to identify a SIScollection of disjoint feature subsets that satisfy the following criteria:
for each
There exists no feature subset for some such that
for (the remaining features outside of the SIScollection)
Criterion (1) ensures that for any SIS , the values of the features in this subset alone suffice to justify the decision in the absence of any information regarding the values of the other features. To ensure information that is not vital to reach the decision is not included within the SIS, criterion (2) encourages each SIS to contain a minimal number of features, which facilitates interpretability. Finally, we require that our SIScollection satisfies a notion of completeness via criterion (3), which states that the same decision is no longer reached for the input after the entire SIScollection has been masked. This implies the remaining feature values of the input no longer contain sufficient evidence for the same decision. Figures 2 and 8 show SIScollections found in text/image inputs.
Recall that denotes a modified input in which the information about the values of features outside subset is considered to be missing. We construct as new input whose values on features in are identical to those in the original , and whose remaining features are each replaced by a special mask used to represent a missing observation. While certain models are specially adapted to handle inputs with missing observations (Smola et al., 2005)
, this is generally not the case. To ensure our approach is applicable to all models, we draw inspiration from data imputation techniques which are a common way to represent missing data
(Rubin, 1976).Two popular strategies include hotdeck imputation, in which unobserved values are sampled from their marginal feature distribution, and mean imputation, in which each simply fixed to the average value of feature in the data. Note that for a linear model, these two strategies are expected to produce an identical change in prediction . We find in practice that the change in predictions resulting from either masking strategy is roughly equivalent even for nonlinear models such as neural networks (Figure S12). In this work, we favor the meanimputation approach over samplingbased imputation, which would be computationallyexpensive and nondeterministic (undesirable for facilitating interpretability). One may also view as the baseline input value used by feature attribution methods (Sundararajan et al., 2017; Shrikumar et al., 2017), a value which should not lead to particularly noteworthy decisions. Since our interests primarily lie in rationalizing atypical decisions, the average input arising from mean imputation serves as a suitable baseline. Zeros have also been used to mask image/categorical data (Li et al., 2017), but empirically, this mask appears undesirably more informative than the mean (predictions more affected by zeromasking).
For an arbitrarily complex function over inputs with many features , the combinatorial search to identify sets which satisfy objectives (1)(3) is computationally infeasible. To find a SIScollection in practice, we employ a straightforward backward selection strategy, which is here applied separately on an examplebyexample basis (unlike standard statistical tools which perform backward selection globally to find a fixed set of features for all inputs). The SIScollection algorithm details our straightforward procedure to identify disjoint SIS subsets that satisfy (1)(3) approximately (as detailed in §3.1) for an input where .
Our overall strategy is to find a SIS subset (via BackSelect and FindSIS), mask it out, and then repeat these two steps restricting each search for the next SIS solely to features disjoint from the currently found SIScollection , until the decision of interest is no longer supported by the remaining feature values. In the BackSelect procedure, denotes the set of remaining unmasked features that are to be considered during backward selection. For the current subset , step 3 in BackSelect identifies which remaining feature produces the minimal reduction in (meaning it least reduces the output of if additionally masked), a question trivially answered by running each of the remaining possibilities through the model. This strategy aims to gradually mask out the least important features in order to reveal the core input pattern that is perceived by the model as sufficient evidence for its decision. Finally, we build our SIS up from the last features omitted during the backward selection, selecting a value just large enough to meet our sufficiency criterion (1). Because this approach always queries a prediction over the joint set of remaining features , it is better suited to account for interactions between these features and ensure their sufficiency (i.e. that ) compared to a forward selection in the opposite direction which builds the SIS upwards one feature at a time by greedily maximizing marginal gains. Throughout its execution, BackSelect attempts to maintain the sufficiency of as the set shrinks.
Given input features, our algorithm requires evaluations of to identify SIS, but we can achieve by parallelizing each argmax in BackSelect (e.g. batching on GPU). Throughout, let denote the output of SIScollection when applied to a given input for which . Disjointness of these sets is crucial to ensure computational tractability and that the number of SIS per example does not grow huge and hard to interpret. Proposition 1 below proves that each SIS produced by our procedure will satisfy an approximate notion of minimality. Because we desire minimality of the SIS as specified by (2), it is not appropriate to terminate the backward elimination in BackSelect as soon as the sufficiency condition is violated, due to the possible presence of local minima in along the path of subsets encountered during backward selection (as shown in Figure S5).
Proposition 2 additionally guarantees that masking out the entirety of the feature values in the SIScollection will ensure the model makes a different decision. Given , it is thus necessarily the case that the observed values responsible for this decision lie within the SIScollection . We point out that for an easily reached decision, where (i.e. this decision is reached even for the average input), our approach will not output any SIS. Because this same decision would likely be anyway reached for a vast number of inputs in the training data (as a sort of default decision), it is conceptually difficult to grasp what particular aspect of the given is responsible.
There exists no feature in any set that can be additionally masked while retaining sufficiency of the resulting subset (i.e. for any ). Also, among all subsets considered during the backward selection phase used to produce , this set has the smallest cardinality of those which satisfy .
For , modified by masking all features in the entire SIScollection , we must have: when .
Unfortunately, nice assumptions like convexity/submodularity are inappropriate for estimated functions in ML. We present various simple forms of practical decision functions for which our algorithms are guaranteed to produce desirable explanations. Example 1 considers interpreting functions of a generalized linear form, Examples 2 & 3 describe functions whose operations resemble generalized logical
OR & AND gates, and Example 4 considers functions that seek out a particular input pattern. Note that features ignored by are always masked in our backward selection and thus never appear in the resulting SIScollection.Suppose the input data are vectors and , where is monotonically increasing. We also presume and the data were centered such that each feature has mean zero (for ease of notation). In this case, must satisfy criteria (1)(3). will consist of the features whose indices correspond to the largest entries of for some suitable that depends on the value of . It is also guaranteed that for any subset of the same cardinality . For each individual feature where , there will be exist a corresponding SIS consisting only of . No SIS will include features whose coefficient , or those whose difference between the observed and average value ( here) is of an opposite sign than the corresponding model coefficient (i.e. ).
Let for some disjoint and functions , such that for the given and threshold : and for each . Such might be functions that model strong interactions between the features in each or look for highly specific value patterns to occur these subsets. In this case, SIScollection will return sets such that .
If and the same conditions from Example 2 are met, then SIScollection will return a single set .
Suppose with where is monotonically decreasing and specifies a fixed pattern of input values for features in a certain subset . For input and threshold choice , SIScollection will return a single set .
We apply our methods to analyze neural networks for text, DNA, and image data. SIScollection is compared with alternative subsetselection methods for producing rationales (see descriptions in Supplement §S1). Note that our BackSelect procedure determines an ordering of elements, , subsequently construct the SIS. Depictions of each SIS are shaded based on the feature order in (darker = later), which can indicate relative feature importance within the SIS.
In the “Suff. IG,” “Suff. LIME,” and “Suff. Perturb.” (sufficiency constrained) methods, we instead compute the ordering of elements according to the feature attribution values output by integrated gradients (Sundararajan et al., 2017), LIME (Ribeiro et al., 2016), or a perturbative approach that measures the change in prediction when individually masking each feature (see §S1). The rationale subset produced under each method is subsequently assembled using FindSIS exactly as in our approach and thus is guaranteed to satisfy . In the “IG,” “LIME,” and “Perturb.” (length constrained) methods, we use the same previously described ordering , but always select the same number of features in the rationale as in the SIS produced by our method (per example). We also compare against the additional “Top IG” method, in which top features from are added into the rationale until sum of integrated gradients attributions suggests that the rationale has met our sufficiency criterion (see §S1).

We first consider a dataset of beer reviews from McAuley et al. (2012). Taking the text of a review as input, different LSTM networks (Hochreiter and Schmidhuber, 1997) are trained to predict userprovided numerical ratings of aspects like aroma, appearance, and palate (details in §S4). Figure 2 shows a sample beer review where we highlight the SIS identified for the LSTM that predicts each aspect. Each SIS only captures sentiment toward the relevant aspect. Figure 2 depicts the SIScollection identified from a review the LSTM decided to flag for positive aroma.
Figure 4 shows that when the alternative methods described in §4 are length constrained, the rationales they produce often badly fail to meet our sufficiency criterion. Thus, even though the same number of feature values are preserved in the rationale and these alternative methods select the features to which they have assigned the largest attribution values, their rationales lead to significantly reduced outputs compared to our SIS subsets. If the sufficiency constraint is instead enforced for these alternative methods, the rationales they identify become significantly larger than those produced by SIScollection, and also contain many more unimportant features (Table S2, Figure S14).
Benchmarking interpretability methods is difficult because a learned may behave counterintuitively such that seemingly unreasonable model explanations are in fact faithful descriptions of a model’s decisionmaking process. For some reviews, a human annotator has manually selected which sentences carry the relevant sentiment for the aspect of interest, so we treat these annotations as an alternative rationale for the LSTM prediction. For a review whose true and predicted aroma exceed our decision threshold, we define the quality of humanselected sentences for model explanation where is the humanselectedsubset of words in the review (see examples in Figure S18). High variability of in the annotated reviews (Figure 4) indicates the human rationales often do not contain sufficient information to preserve the LSTM’s decision. Figure 4 shows the LSTM makes many decisions based on different subsets of the text than the parts that humans find appropriate for this task. Reassuringly, our SIS more often lie within the selected annotation for reviews with high scores.
We next analyze convolutional neural networks (CNN) used to classify whether a given transcription factor (TF) will bind to a specific DNA sequence
(Zeng et al., 2016). From 422 different datasets of DNA sequences boundornot by different TFs (and 422 different CNN models), we extract SIScollections from sequences with high (top 10%) predicted binding affinity for the TF profiled in each dataset (details in §S2). Figure 6 depicts two input examples and the corresponding identified SIS. Again, rationales produced via our SIS approach are shorter and better at preserving large values than rationales from other methods (Figures S4 and S4).To predict binding so accurately, the CNN must faithfully reflect the biological mechanisms that relate the DNA sequence to the probability of TF occupancy. We evaluate the rationales found by our methods against known TF binding motifs from JASPAR (Mathelier et al., 2015), adopting KL divergence between the known motif and each proposed rationale as a quality measure (see §S2.3). Figure 6 shows the divergence of rationales produced by SIScollection is significantly lower than that of rationales identified using other methods (Wilcoxon in all cases). SIS is thus more effective at uncovering the underlying biological principles than the alternative methods we applied.
Finally, we study a 10way CNN classifier trained on the MNIST handwritten digits data (LeCun et al., 1998). Here, we only consider predicted probabilities for one class of interest at a time and always set as the probability threshold for deciding that an image belongs to the class. We extract the SIScollection from all corresponding test set examples (details in §S3). Example images and corresponding SIScollections are shown in Figures 8 and S8. Figure 8a illustrates how the SIScollection drastically changes for an example of a correctlyclassified 9 that has been adversarially manipulated (Carlini and Wagner, 2017) to become confidently classified as the digit 4. Furthermore, these SIScollections immediately enable us to understand why certain misclassifications occur (Figure 8b).
Identifying the different input patterns that justify a decision can help us better grasp the general operating principles of a model. To this end, we cluster all of the SIS produced by SIScollection applied across a large number of data examples that received the same decision. Clustering is done via DBSCAN, a widely applicable algorithm that merely requires specifying pairwise distances between points (Ester et al., 1996).
We first apply this procedure to the SIS found across all testset DNA sequences which our CNN model predicted would be bound by some TF. Here, the pairwise distance between two sufficient input subsets is taken to be the Levenshtein (edit) distance. Figure 6 shows the clusters for a particular TF where two SIS clusters were found. Despite no contiguity being enforced in our algorithm, each cluster is comprised of short sequences that clearly capture different aspects of the underlying DNA motif known to bind this TF.
We also apply DBSCAN clustering to the SIS found across all MNIST testexamples confidently identified by the CNN as a particular class. Pairwise distances are here defined as the energy distance (Rizzo and Székely, 2016) over pixel locations between two SIS subsets (see §S3.3). Figure 8 depicts the SIS clusters identified for digit 4 (others in Figure S9). These reveal distinct feature patterns learned by the CNN to distinguish 4 from other digits, which are clearly present in the vast majority of test set images confidently classified as a 4. For example, cluster depicts parallel slanted lines, a pattern that never occurs in other digits.
Subsequently, we cluster the SIS found across heldout beer reviews (TestFold in Table S1) that received positive aroma predictions from our LSTM network. The distance between two SIS is taken as the Jaccard distance between their bag of words representations. Three clusters depicted in Table 2 (rest in Tables S3, S4) reveal isolated phrases that the LSTM associates with positive aromas in the absence of other context.
Clu.  SIS #1  SIS #2  SIS #3  SIS #4 
smell amazing wonderful  nice wonderful nose  wonderful amazing  amazing amazing  
grapefruit mango pineapple  pineapple grapefruit pineapple grapefruit  hops grapefruit pineapple floyds  mango pineapple incredible  
creme brulee brulee  creme brulee decadent  incredible creme brulee  creme brulee exceptional 
Clu.  LSTM  SIS #1  SIS #2  SIS #3  SIS #4 
0%  delicious        
0%  very nice        
20%  rich chocolate  very rich  chocolate complex  smells rich  
33%  oak chocolate  chocolate raisins raisins oak bourbon  chocolate oak  raisins chocolate  
70%  complex aroma  aroma complex peaches complex  aroma complex interesting cherries  aroma complex 
The general insights revealed by our SISclustering can also be used to compare the operatingbehavior of different models. For the beer reviews, we also train a CNN to compare with our existing LSTM (see §S4.6
). For MNIST, we train a multilayer perceptron (MLP) and compare to our existing CNN (see §
S3.5). Both networks exhibit similar performance in each task, so it is not immediately clear which model would be preferable in practice. Figure 10 shows the SIS extracted under one model are typically insufficient to receive the same decision from the other model, indicating these models base their positive predictions on different evidence.Figure 9 depicts results from a joint clustering of all SIS extracted from heldout MNIST images confidently classified as a 4 by either the MLP or CNN. Evidently, our MNISTCNN bases its confidence primarily on spatiallycontiguous strokes comprising only a small portion of each digit. MLPdecisions are in contrast based on pixels located throughout the digit, demonstrating this model relies more on the global shape of the handwriting. Thus, the CNN is more susceptible to mistaking other (nondigit) handwritten characters for 4s if they happen to share some of the same strokes. Table 2 contains results of jointly clustering the SIS extracted from beer reviews with positive aroma predictions under our LSTM or textCNN. This CNN tends to learn localized (unigram/bigram) word patterns, while the LSTM identifies more complex multiword interactions that truly seem more relevant to the target aroma value. Many CNNSIS are simply phrases with universallypositive sentiment, indicating this model is less capable at distinguishing between positive sentiment toward aroma vs. other aspects such as taste/look.
This work introduced the idea of interpreting blackbox decisions on the basis of sufficient input subsets – minimal input patterns that alone provide sufficient evidence to justify a particular decision. Our methodology is easy to understand for nonexperts, applicable to all ML models without any additional training steps, and remains fully faithful to the underlying model without making approximations. While we focus on local explanations of a single decision, clustering the SISpatterns extracted from many data points reveals insights about a model’s general decisionmaking process. Given multiple models of comparable accuracy, SISclustering can uncover critical operating differences, such as which model is more susceptible to spurious training data correlations or will generalize worse to counterfactual inputs that lie outside the data distribution.
We thank Haoyang Zeng and Ge Liu for helping with the TF data/models. This work was supported by NIH Grants R01CA218094, R01HG008363, and R01HG008754.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
.NIPS Workshop: Interpreting, Explaining and Visualizing Deep Learning  Now what?
Empirical Methods in Natural Language Processing
.Interpretable predictions of clinical outcomes with an attentionbased recurrent neural network.
In Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics.Supplementary Information for: What made you do this?
Understanding blackbox decisions with sufficient input subsets
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
.NIPS Workshop: Interpreting, Explaining and Visualizing Deep Learning  Now what?
Empirical Methods in Natural Language Processing
.Interpretable predictions of clinical outcomes with an attentionbased recurrent neural network.
In Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics.In Section 3
, we describe a number of alternative methods for identifying rationales for comparison with our method. We use methods based on integrated gradients Sundararajan17si, LIME limesi, and feature perturbation. Note that integrated gradients is an attribution method which assigns a numerical score to each input feature. LIME likewise assigns a weight to each feature using a local linear regression model for
around . In the perturbative approach, we compute the change in prediction when each feature is individually masked, as in Equation 1 (of Section S4.4). Each of these feature orderings is used to construct a rationale using the FindSIS procedure (Section 3) for the “Suff. IG,” “Suff. LIME,” and “Suff. Perturb.” (sufficiency constrained) methods.Note that our text classification architecture (described in Section S4.2) encodes discrete words as 100dimensional continuous word embeddings. The integrated gradients method returns attribution scores for each coordinate of each word embedding. For each word embedding (where each ), we summarize the attributions along the corresponding embedding into a single score using the norm: and compute the ordering by sorting the values.
We use an implementation of integrated gradients for Kerasbased models from
https://github.com/hiranumn/IntegratedGradients. In the case of the beer review dataset (Section 4.1), we use the mean embedding vector as a baseline for computing integrated gradients. In the case of TF binding (Section 4.2), we use the uniform mean vector as the baseline reference value. As suggested in Sundararajan17si, we verified that the prediction at the baseline and the integrated gradients sum to approximately the prediction of the input.For LIME and our beer reviews dataset, we use the approach described in limesi for textual data, where individual words are removed entirely from the input sequence.
In our TF binding dataset, LIME replaces bases with the unknown N
base (represented as the uniformdistribution
). We use the implementation of LIME at: https://github.com/marcotcr/lime. TheLimeTextExplainer
module is used with default parameters, except we set the maximal number of features used in the regression to be the full input length so we can order all input features.
Additionally, we explore methods in which we use the same ordering by these alternative methods but select the same number of input features in the rationale to be the median SIS length in the SIScollection computed by our method on each example: the “IG,” “LIME,” and “Perturb.” (length constrained) methods. In the TF binding models, we use a baseline of zero vectors such that the integrated gradients result along the encoded sequence is also onehot. We compute the feature ordering based on the absolute value of the nonzero integrated gradient attributions.
In TF binding data (Section 4.2), we add an additional method, “Top IG,” in which we compute integrated gradients using an allzeros baseline and order features by attribution magnitude (as in the length constrained IG method). But, we select elements for the rationale by finding the minimum number of elements necessary such that the sum of integrated gradients of those features equals , where is the allzeros baseline for integrated gradients. Note that for the length constrained and Top IG methods, there is no guarantee of sufficiency for any input subset .
We use the motif occupancy datasets^{1}^{1}1available at http://cnn.csail.mit.edu from zeng2016convolutionalsi, where each dataset originates from a ChIPseq experiment from the ENCODE project encode2012integratedsi.
Each of the 422 datasets studies a particular transcription factor, containing between 600 and 700,000 (median 50,000) 101 basepair DNA sequences (inputs) each associated with a binary label based on whether the sequence is bound by the TF or not.
Each dataset also contains a test set ranging between 150 and 170,000 sequences (median 12,000).
Here, the positive and negative classes in each dataset are balanced, and we filter out all sequences containing the unknown base (N
).
The nucleotide occurring at base position (A
, C
, G
, T
) is encoded as a onehot representation which is fed into the CNN. zeng2016convolutionalsi showed that convolutional neural network architectures outperform other models for this TF binding prediction task.
For each of the 422 prediction tasks, we employ the bestperforming “1layer_128motif” architecture from zeng2016convolutionalsi, defined as follows:
Input: (101 x 4) sequence encoding
Convolutional Layer 1
: Applies 128 kernels of window size 24, with ReLU activation
Global Max Pooling Layer 1
: Performs global max pooling
Dense Layer 1
: 32 neurons, with ReLU activation and dropout probability 0.5
Dense Layer 2: 1 neuron (output probability), with sigmoid activation
We hold out 1/8 of each train set for validation and minimize binary crossentropy using the Adadelta optimizer adadeltasi with default parameter settings in Keras kerassi. We train each model on each of the 422 datasets for 10 epochs (using batch size 128) with earlystopping based on validation loss. Figure
S2 shows the area under the receiver operating curve (AUC) over the 422 datasets, and we note that the performance of our models closely resembles that in zeng2016convolutionalsi.For each dataset, we define the sufficiency threshold as the 90th percentile of the predictive distribution on all test sequences. The distribution of thresholds is shown in Figure S2. We compute the complete set of sufficient input subsets for each corresponding test sequence. Since A,C,G,T nucleotides all occur with similar frequency in this data, our SIS analysis simply masks each base using a uniform embedding (). This is also the standard strategy to represent unknown “N” nucleotides in DNA sequences that typically arise from issues in read quality. We generally find that there is only a single SIS per example for the sequences in these datasets.
On each dataset, we compute the median rationale length (as number of bases in the rationale). The distribution of median rationale length over all datasets by various methods is shown in Figure S4. Note that for the IG, LIME, and Perturb. methods, rationale length was constrained to the length of the rationales produced by our method. For the Top IG method, neither sufficiency or length constraints are enforced. We see that when the sufficiency constraint is enforced in alternative methods (Suff. IG), the rationales are significantly longer than those identified by SIS. Moreover, as shown in Figure S4, when the sufficiency constraint is not enforced (or the rationale lengths are constrained to the length of SIS rationales) in alternative methods, the rationales have significantly less predictive power, often not satisfying .
Each rationale is padded with “
N
” (unknown) bases to the length of a full input sequence (101 bases) and optimally aligned with the known motif^{2}^{2}2A JASPAR motif is a right stochastic matrix
. The columns represent the ACGT DNA bases and the rows a DNA sequence. It represents the marginal probability of the base at position being present with probability . The unknown base “N” receives uniform probability for each of ACGT. according to the likelihood criterion. The aligned motif is then also padded to the same length, and we compute the divergence between between the rationale and known motif as:where
is the KullbackLeibler divergence from
to , and and are distributions over bases (A
, C
, G
, T
) at position .
Note that as and become more dissimilar, increases.
We ensure so is always finite.
The MNIST database of handwritten digits contains 60k training images and 10k test images mnistsi. All images are 28x28 grayscale, and we normalize them such that all pixel values are between 0 and 1. We use the convolutional architecture provided in the Keras MNIST CNN example.
^{3}^{3}3http://github.com/kerasteam/keras/blob/master/examples/mnist_cnn.py The architecture is as follows:Input: (28 x 28 x 1) image, all values
Convolutional Layer 1: Applies 32 3x3 filters with ReLU activation
Convolutional Layer 2: Applies 64 3x3 filters, with ReLU activation
Pooling Layer 1: Performs max pooling with a 2x2 filter and dropout probability 0.25
Dense Layer 1: 128 neurons, with ReLU activation and dropout probability 0.5
Dense Layer 2: 10 neurons (one per digit class), with softmax activation
The Adadelta optimizer adadeltasi is used to minimize crossentropy loss on the training set. The final model achieves 99.7% accuracy on the train set and 99.1% accuracy on the heldout test set.
Figure S5 demonstrates an example MNIST digit for which there exists a local minimum in the backward selection phase of our algorithm to identify the initial SIS. Note that if we were to terminate the backward selection as soon as predictions drop below the decision threshold, the resulting SIS would be overly large, violating our minimality criterion. It is also evident from Figure S5 that the smallercardinality SIS in (d), found after the initial local optimum in (c), presents a more interpretable input pattern that enables better understanding of the core motifs influencing our classifier’s decisions. To avoid suboptimal results, it is important to run a complete backward selection sweep until the entire input is masked before building the SIS upward, as done in our SIScollection procedure.
To cluster SIS from the image data, we compute the pairwise distance between two SIS subsets and as the energy distance energydistancesi between two distributions over the image pixel coordinates that comprise the SIS, and :
Here, is uniformly distributed over the pixels that are selected as part of the SIS subset , is an i.i.d. copy of , and represents the Euclidean norm. Unlike a Euclidean distance between images, our usage of the energy distance takes into account distances between the similar pixel coordinates that comprise each SIS. The energy distance offers a more efficiently computable integral probability metric than the optimal transport distance, which has been widely adopted as an appropriate measure of distance between images.
We set the threshold for SIS to ensure that the model is confident in its class prediction (probability of the predicted class is 0.7). Almost all test examples initially have for the top class (Figure S7). We identify all test examples that satisfy this condition and use SIS to identify all sufficient input subsets. The number of sufficient input subsets per digit is shown in Figure S7.
We apply our SIScollection algorithm to identify sufficient input subsets on MNIST test digits (Section 4.3). Examples of the complete SIScollection corresponding to randomly chosen digits are shown in Figure S8. We also cluster all the sufficient input subsets identified for each class (Section 4.4), depicting the results in Figure S9.
In Figure 8, we show an MNIST image of the digit 9, adversarially perturbed to 4, and the sufficient subsets corresponding to the adversarial prediction. Although a visual inspection of the perturbed image does not really reveal exactly how it has been manipulated, it becomes immediately clear from the SIScollection for the adversarial image. These sets shows that the perturbation modifies pixels in such a way that input patterns similar to the typical SIScollection for a 4 (Figure 8) become embedded in the image. The adversarial manipulation was done using the CarliniWagner (CW2) attack^{4}^{4}4Implemented in the cleverhans library of papernot2017cleverhans carlini2017towardssi with a confidence parameter of 10. The CW2 attack tries to find the minimal change to the image, with respect to the norm, that will lead the image to be misclassified. carlini2017adversarial demonstrate it to be one of the strongest extant adversarial attacks.
We use SIS and our clustering procedure to understand and visualize differences in features learned by two different models trained on the same MNIST digit classification task. In addition to the previouslydescribed CNN model (see Section S3.1), we also trained a simple multilayer perceptron (MLP) on the same task. The MLP architecture is as follows:
Input: 784dimensional (flattened) image, all values
Dense Layer 1: 250 neurons, ReLU activation, and dropout probability 0.2
Dense Layer 2: 250 neurons, ReLU activation, and dropout probability 0.2
Dense Layer 3: 10 neurons (one per digit class), with softmax activation
As with the CNN, Adadelta adadeltasi is used to minimize crossentropy loss on the training set. The final MLP model achieves 99.7% accuracy on the train set and 98.3% accuracy on the test set, which is close to the performance of the CNN (see Section S3.1).
We apply the same procedure as in Section 4.3 to extract the SIScollection from all applicable test images using the MLP. To understand differences between the feature patterns that each model has learned to associate with predicting each digit, we combine all SIS (from both models for a particular class) and run our clustering procedure (see Section 4.4 and Figure 9). In the resulting clustering, we list what percentage of the SIS in each cluster stem from the CNN vs. the MLP. Most clusters contain examples purely from a single model, indicating the two models have learned to associate different feature patterns with the target class (Figure 9), which was chosen to be the digit 4 in this case.
For further comparison, we include clustering results for the SIS extracted from the MLP as evidence for digits 4 and 7 (Figure S10). Additionally, Figure S11 shows all of the SIS extracted from example digits from these classes applying our procedure on the MLP.
Following Lei16si, we use a preprocessed version of the BeerAdvocate^{5}^{5}5https://www.beeradvocate.com/ dataset^{6}^{6}6http://snap.stanford.edu/data/webBeerAdvocate.html which contains decorrelated numerical ratings toward three aspects: aroma, appearance, and palate (each normalized to ). Dataset statistics can be found in Table S1. Reviews were tokenized by converting to lowercase and filtering punctuation, and we used a vocabulary containing the top 10,000 most common words. mcauley2012learningsi also provide a subset of humanannotated reviews, in which humans manually selected full sentences in each review that describe the relevant aspects. This annotated set was never seen during training and used solely as part of our evaluation.
Long shortterm memory (LSTM) networks are commonly employed for natural language tasks such as sentiment analysis wang2016attentionsi, radford2017learning. We use a recurrent neural network (RNN) architecture with two stacked LSTMs as follows:
Input/Embeddings Layer: Sequence with 500 timesteps, the word at each timestep is represented by a (learned) 100dimensional embedding
LSTM Layer 1: 200unit recurrent layer with LSTM (forward direction only)
LSTM Layer 2: 200unit recurrent layer with LSTM (forward direction only)
Dense: 1 neuron (sentiment output), sigmoid activation
With this architecture, we use the Adam optimizer adamsi to minimize mean squared error (MSE) on the training set. We use a heldout set of 3,000 examples for validation (sampled at random from the predefined test set from Lei16si). Our test set consists of the remaining 7,000 test examples. Training results are shown in Table S1.
Aspect  Fold  Size  MSE  Pearson 

Appearance  Train  80,000  0.016  0.864 
Validation  3,000  0.024  0.783  
Test  7,000  0.023  0.801  
Annotation  994  0.020  0.563  
Aroma  Train  70,000  0.014  0.873 
Validation  3,000  0.024  0.767  
Test  7,000  0.025  0.756  
Annotation  994  0.021  0.598  
Palate  Train  70,000  0.016  0.835 
Validation  3,000  0.029  0.680  
Test  7,000  0.028  0.694  
Annotation  994  0.016  0.592 
In Section 3, we discuss the problem of masking input features. Here, we show that the meanimputation approach (in which missing inputs are masked with a mean embedding, taken over the entire vocabulary) produces a nearly identical change in prediction to a nondeterministic hotdeck approach (in which missing inputs are replaced by randomly sampling featurevalues from the data). Figure S12 shows the change in prediction by both imputation techniques after drawing a training example and word (both uniformly at random) and replacing with either the mean embedding or a randomly selected word (drawn from the vocabulary, based on counts in the training corpus). This procedure is repeated 10,000 times. Both resulting distributions have mean near zero (, ), and the distribution for mean embedding is slightly narrower (, ). We conclude that meanimputation is a suitable method for masking information about particular feature values in our SIS analysis.
We also explored other options for masking word information, e.g. replacement with a zero embedding, replacement with the learned <PAD> embedding, and simply removing the word entirely from the input sequence, but each of these alternative options led to undesirably larger changes in predicted values as a result of masking, indicating they appear more informative to than replacement via the featuremean.
For each feature in the input sequence, we quantify its marginal importance by individually perturbing only this feature:
(1) 
Note that these marginal Feature Importance scores are identical to those of the Perturb. method described in Section S1. The marginal Feature Importance scores are summarized in Table S2 and Figure S14. Compared to the Suff. IG and Suff. LIME methods, our SIScollection technique produces rationales that are much shorter and contain fewer irrelevant (i.e. not marginally important) features (Table S2, Figures S14 and S14). Note that by construction, the rationales of the Suff. Perturb. method contain features with the greatest Feature Importance, since this precisely how the ranking in Suff. Perturb. is defined.
Method  Rationale Length (% of text)  Marginal Perturbed Feature Importance  

Med.  Max  (vs. SIS)  Med. (Rationale)  Med. (Other)  (vs. SIS)  
SIS  3.9%  17.3%  –  0.0112  1.50e05  – 
Suff. IG  7.7%  89.7%  5e26  0.0068  1.85e05  3e42 
Suff. LIME  7.2%  84.0%  4e23  0.0075  1.87e05  1e35 
Suff. Perturb.  5.1%  18.3%  1e06  0.0209  1.90e05  1e72 
We apply our method to the set of reviews containing sentencelevel annotations. Note that these reviews (and the human annotations) were not seen during training. We choose thresholds , for strong positive and strong negative sentiment, respectively, and extract the complete set of sufficient input subsets using our method. Note that in our formulation above, we apply our method to inputs where . For the sentiment analysis task, we analogously apply our method for both and , where the model predicts either strong positive or strong negative sentiment, respectively. These thresholds were set empirically such that they were sufficiently apart, based on the distribution of predictions (Figure S16). For most reviews, SIScollection outputs just one or two SIS sets (Figure S16).
We analyzed the predictor output following the elimination of each feature in the BackSelect procedure (Section 3). Figure S17 shows the LSTM output on the remaining unmasked text at each iteration of BackSelect, for all examples. This figure reveals that only a small number of features are needed by the model in order to make a strong prediction (most features can be removed without changing the prediction). We see that as those final, critical features are removed, there is a rapid, monotonic decrease in output values. Finally, we see that the first features to be removed by BackSelect are those which generally provide negative evidence against the decision.
Cluster  SIS #1  Freq.  SIS #2  Freq.  SIS #3  Freq.  SIS #4  Freq. 

smell amazing wonderful  2  nice wonderful nose  2  wonderful amazing  2  amazing amazing  2  
grapefruit mango pineapple  2  pineapple grapefruit pineapple grapefruit  1  hops grapefruit pineapple floyds  1  mango pineapple incredible  1  
nice smell citrus nice grapefruit taste  1  smell great complex ripe taste  1  nice smell nice hop smell pine taste  1  love nice nice smell bliss taste  1  
fresh great fantastic taste  1  rich great fantastic hoped  1  fantastic cherries fantastic  1  everyone great snifters fantastic  1  
awesome bounds  1  awesome grapefruit awesome  1  awesome awesome pleasing  1  awesome nailed nailed  1  
creme brulee brulee  3  creme brulee decadent  1  incredible creme brulee  1  creme brulee exceptional  1  
oak vanilla chocolate cinnamon vanilla oak love  1  dose oak chocolate vanilla acidic  1  vanilla figs oak thinner great  1  chocolate aroma oak vanilla dessert  1 
Cluster  SIS #1  Freq.  SIS #2  Freq.  SIS #3  Freq.  SIS #4  Freq. 

awful  15  skunky skunky  9  skunky t  7  skunky taste  6  
garbage  3  taste garbage  1  garbage avoid  1  garbage rice  1  
vomit  16              
gross rotten  1  rotten forte  1  awkward rotten  1  rotten offputting  1  
rancid horrid  1  rancid t  1  rancid  1  rancid avoid  1  
rice t rice  2  rice rice  1  rice tasteless  1  budweiser rice  1 
We demonstrate how our SISclustering procedure can be used to understand differences in the types of concepts considered important by different neural network architectures. In addition to the LSTM (see Section S4.2), we trained a convolutional neural network (CNN) on the same sentiment analysis task (on the aroma aspect). The CNN architecture is as follows:
Input/Embeddings Layer: Sequence with 500 timesteps, the word at each timestep is represented by a (learned) 100dimensional embedding
Convolutional Layer 1: Applies 128 filters of window size 3 over the sequence, with ReLU activation
Max Pooling Layer 1: Maxovertime pooling, followed by flattening, to produce a representation
Dense: 1 neuron (sentiment output), sigmoid activation
Note that a new set of embeddings was learned with the CNN. As with the LSTM model, we use Adam adamsi to minimize MSE on the training set. For the aroma aspect, this CNN achieves 0.016 (0.850), 0.025 (0.748), 0.026 (0.741), 0.014 (0.662) MSE (and Pearson ) on the Train, Validation, Test, and Annotation sets, respectively. We note that this performance is very similar to that from the LSTM (see Table S1).
We apply our procedure to extract the SIScollection from all applicable test examples using the CNN, as in Section 4.1. Figure 10 shows the predictions from one model (LSTM or CNN) when fed input examples that are SIS extracted with respect to the other model (for reviews predicted to have positive sentiment toward the aroma aspect). For example, in Figure 10, “CNN SIS Preds by LSTM” refers to predictions made by the LSTM on the set of sufficient input subsets produced by applying our SIScollection procedure on all examples for which .^{7}^{7}7For experiments involving clustering and/or comparing different models, we use examples drawn from the Test fold (instead of Annotation fold, see Table S1) to consider a larger number of examples. Since the word embeddings are modelspecific, we embed each SIS using the embeddings of the model making the prediction (note that while the embeddings are different, the vocabulary is the same across the models).
In Table 2, we show five example clusters (and cluster composition) resulting from clustering the combined set of all sufficient input subsets extracted by the LSTM and CNN on reviews in the test set for which a model predicts positive sentiment toward the aroma aspect. The complete clustering on reviews receiving positive sentiment predictions is shown in Table S5 and in Table S6 for reviews receiving negative sentiment predictions.
Cluster  SIS #1  Freq.  SIS #2  Freq.  SIS #3  Freq.  SIS #4  Freq. 

(LSTM: 20%)  rich chocolate  13  very rich  9  chocolate complex  5  smells rich  4 
(LSTM: 21%)  great  248  amazing  119  wonderful  112  fantastic  75 
(LSTM: 47%)  best smelling  23  pineapple mango  6  mango pineapple  6  pineapple grapefruit  5 
(LSTM: 5%)  excellent  42  excellent flemish flemish  1  excellent excellent phenomenal  1     
(LSTM: 33%)  oak chocolate  2  chocolate raisins raisins oak bourbon  1  chocolate oak  1  raisins chocolate  1 
(LSTM: 5%)  goodness  19  watering goodness  1         
(LSTM: 24%)  pumpkin pie  25  huge pumpkin aroma pumpkin pie  1  aroma perfect pumpkin pie taste  1  smell pumpkin nutmeg cinnamon pie  1 
(LSTM: 5%)  jd  13  tremendous  8  tremendous jd  1     
(LSTM: 40%)  brulee  14  creme brulee brulee  3  creme creme  1  creme brulee amazing  1 
(LSTM: 0%)  s wow  20             
(LSTM: 0%)  delicious  56             
(LSTM: 0%)  very nice  23             
(LSTM: 70%)  complex aroma  5  aroma complex peaches complex  1  aroma complex interesting cherries  1  aroma complex  1 
Cluster  SIS #1  Freq.  SIS #2  Freq.  SIS #3  Freq.  SIS #4  Freq. 

(LSTM: 29%)  not  247  no  105  bad  104  macro  94 
(LSTM: 100%)  gross rotten  1             
(LSTM: 100%)  rotten garbage  1             
(LSTM: 62%)  vomit  26             
(LSTM: 21%)  budweiser  22  sewage budweiser  1  metal budweiser  1  budweiser budweiser budweiser  1 
(LSTM: 100%)  garbage rice  1             
(LSTM: 3%)  n’t  19  adjuncts  14  n’t adjuncts  1     
(LSTM: 0%)  faint  82             
(LSTM: 0%)  adjunct  42             
For posterity, we include results here from repeating the analysis in our paper for the two other nonaroma aspects measured in the beer reviews data: appearance and palate.
Method  Rationale Length (% of text)  Marginal Perturbed Feature Importance  

Med.  Max  (vs. SIS)  Med. (Rationale)  Med. (Other)  (vs. SIS)  
SIS  2.6%  10.6%  –  0.0183  1.72e05  – 
Suff. IG  3.7%  89.3%  2e09  0.0184  2.41e05  1e02 
Suff. LIME  3.7%  98.2%  8e09  0.0167  2.38e05  6e09 
Suff. Perturb.  3.0%  14.9%  9e03  0.0339  2.51e05  5e44 
Cluster  SIS #1  Freq.  SIS #2  Freq.  SIS #3  Freq.  SIS #4  Freq. 

beautiful  376  nitro  51  looks great  38  great looking  32  
gorgeous  83              
beautifully  7  absolutely beautifully  2  beautifully pillowy  1  beautifully bands  1  
brilliant  5  brilliant slowly  1  wonderfully brilliant  1  appearance brilliant  1  
lovely looking  3  black lovely  3  impressive lovely  1  lovely crystal  1 
Cluster  SIS #1  Freq.  SIS #2  Freq.  SIS #3  Freq.  SIS #4  Freq. 

piss  46  zero  38  water water  37  water  27  
unappealing  12  floaties  12  floaties unappealing  1      
ugly  12             
Method  Rationale Length (% of text)  Marginal Perturbed Feature Importance  

Med.  Max  (vs. SIS)  Med. (Rationale)  Med. (Other)  (vs. SIS)  
SIS  2.4%  13.7%  –  0.0210  8.94e07  – 
Suff. IG  3.2%  56.1%  2e06  0.0163  9.54e07  6e10 
Suff. LIME  3.0%  57.0%  7e06  0.0173  1.19e06  2e07 
Suff. Perturb.  2.8%  11.8%  3e03  0.0319  1.25e06  5e26 
Cluster  SIS #1  Freq.  SIS #2  Freq.  SIS #3  Freq.  SIS #4  Freq. 

smooth creamy  27  silky smooth  20  mouthfeel perfect  16  creamy perfect  12  
mouthfeel exceptional  6  exceptional mouthfeel  4          
perfect  50  perfect perfect  6          
smooth velvety  6  velvety smooth  6          
silk  11              
smooth perfect  8  mouth smooth perfect  1  perfect smooth  1      
perfect great  5  great perfect  2  feels perfect  2  perfect feels great  1 
Cluster  SIS #1  Freq.  SIS #2  Freq.  SIS #3  Freq.  SIS #4  Freq. 

overcarbonated  12  mouthfeel overcarbonated  3  way overcarbonated  1  overcarbonated mouthfeel  1  
watery  302  thin  238  flat  118  mouthfeel thin  33  
too carbonation masks  1  too carbonation d  1  mouthfeel odd too too 
1  too carbonated admire  1  
lack carbonation  4  carbonation lack  4  carbonation hurts  2  issue lack hurts  1 
apa interpretability