Code for the 2018 EMNLP Interpretability Workshop Paper "Interpreting Neural Networks with Nearest Neighbors"
Local model interpretation methods explain individual predictions by assigning an importance value to each input feature. This value is often determined by measuring the change in confidence when a feature is removed. However, the confidence of neural networks is not a robust measure of model uncertainty. This issue makes reliably judging the importance of the input features difficult. We address this by changing the test-time behavior of neural networks using Deep k-Nearest Neighbors. Without harming text classification accuracy, this algorithm provides a more robust uncertainty metric which we use to generate feature importance values. The resulting interpretations better align with human perception than baseline methods. Finally, we use our interpretation method to analyze model predictions on dataset annotation artifacts.READ FULL TEXT VIEW PDF
In order for machine learning to be deployed and trusted in many
Feature importance is commonly used to explain machine predictions. Whil...
Given a model f that predicts a target y from a vector of input features...
Neural networks have been criticized for their lack of easy interpretati...
Being able to explain a prediction as well as having a model that perfor...
The investigations reported in this paper center on the process of dynam...
The problem of explaining deep learning models, and model predictions
Code for the 2018 EMNLP Interpretability Workshop Paper "Interpreting Neural Networks with Nearest Neighbors"
The growing use of neural networks in sensitive domains such as medicine, finance, and security raises concerns about human trust in these machine learning systems. A central question is test-timeinterpretability: how can humans understand the reasoning behind model predictions?
A common way to interpret neural network predictions is to identify the most important input features. For instance, a saliency map that highlights important pixels in an image Sundararajan et al. (2017) or words in a sentence Li et al. (2016). Given a test prediction, the importance of each input feature is the change in model confidence when that feature is removed.
However, neural network confidence is not a proper measure of model uncertainty Guo et al. (2017). This issue is emphasized when models make highly confident predictions on inputs that are completely void of information, for example, images of pure noise Goodfellow et al. (2015) or meaningless text snippets Feng et al. (2018). Consequently, a model’s confidence may not properly reflect whether discriminative input features are present. This issue makes it difficult to reliably judge the importance of each input feature using common confidence-based interpretation methods Feng et al. (2018).
To address this, we apply Deep k-Nearest Neighbors (DkNN) Papernot and McDaniel (2018)
to neural models for text classification. Concretely, predictions are no longer made with a softmax classifier, but using the labels of thetraining examples whose representations are most similar to the test example (Section 3). This provides an alternative metric for model uncertainty, conformity
, which measures how much support a test prediction has by comparing its hidden representations to the training data. This representation-based uncertainty measurement can be used in combination with existing interpretation methods, such as leave-one-outLi et al. (2016), to better identify important input features.
We combine DkNN with CNN and LSTM models on six nlp
text classification tasks, including sentiment analysis and textual entailment, with no loss in classification accuracy (Section4). We compare interpretations generated using DkNN conformity to baseline interpretation methods, finding DkNN interpretations rarely assign importance to extraneous words that do not align with human perception (Section 5). Finally, we generate interpretations using DkNN conformity for a dataset with known artifacts (snli), helping to indicate whether a model has learned superficial patterns. We open source the code for DkNN and our results.111https://github.com/Eric-Wallace/deep-knn
Feature attribution methods explain a test prediction by assigning an importance value to each input feature (typically pixels or words).
In the case of text classification, we have an input sequence of words
, represented as one-hot vectors. The word sequence is then converted to a sequence of word embeddings. A classifier , with its probability serving as the model confidence. To create an interpretation, each input word is assigned an importance value, , which indicates the word’s contribution to the prediction. A saliency map (or heat map) visually highlights the words in a sentence.
A simple way to define the importance is via leave-one-out Li et al. (2016): individually remove a word from the input and see how the confidence changes. The importance of word is the decrease in confidence222equivalently the change in class score or cross entropy loss when word is removed:
where is the input sequence with the th word removed and is the model confidence for class . This can be repeated for all words in the input. Under this definition, the sign of the importance value is opposite the sign of the confidence change: if a word’s removal causes a decrease in the confidence, it gets a positive importance value. We refer to this interpretation method as Confidence leave-one-out in our experiments.
In the case of neural networks, the model as a function of word is a highly non-linear, differentiable function. Rather than leaving one word out at a time, we can simulate a word’s removal by approximating with a function that is linear in through the first-order Taylor expansion. The importance of is computed as the derivative of with respect to the one-hot vector:
Thus, a word’s importance is the dot product between the gradient of the class prediction with respect to the embedding and the word embedding itself. This gradient approximation simulates the change in confidence when an input word is removed and has been used in various interpretation methods for nlp Arras et al. (2016); Ebrahimi et al. (2017). We refer to this interpretation approach as Gradient in our experiments.
show how a lack of model robustness and stability can cause egregious interpretation failures in computer vision settings.Feng et al. (2018) extend this to nlp and draw connections between interpretation failures and adversarial examples Szegedy et al. (2014). To counteract this, new interpretation methods alone are not enough—models must be improved. For instance, Feng et al. (2018) argue that interpretation methods should not rely on prediction confidence as it does not reflect a model’s uncertainty.
Following this, we improve interpretations by replacing the softmax confidence with a more robust uncertainty estimate usingDkNN Papernot and McDaniel (2018). This algorithm maintains the accuracy of standard image classification models while providing a better uncertainty metric capable of defending against adversarial examples.
This section describes Deep k-Nearest Neighbors, its application to sequential inputs, and how we use it to determine word importance values.
Papernot and McDaniel (2018) propose Deep k-Nearest Neighbors (DkNN), a modification to the test-time behavior of neural networks.
After training completes, the DkNN algorithm passes every training example through the model and saves each of the layer’s representations. This creates a new dataset, whose features are the representations and whose labels are the model predictions. Test-time predictions are made by passing an example through the model and performing k-nearest neighbors classification on the resulting representations. This modification does not degrade the accuracy of image classifiers on several standard datasets Papernot and McDaniel (2018).
For our purposes, the benefit of DkNN is the algorithm’s uncertainty metric, the conformity score. This score is the percentage of nearest neighbors belonging to the predicted class. Conformity follows from the framework of conformal prediction Shafer and Vovk (2008) and estimates how much the training data supports a classification decision.
The conformity score uses the representations at each neural network layer, and therefore, a prediction only receives high conformity if it largely agrees with the training data at all representation levels. This mechanism defends against adversarial examples Szegedy et al. (2014), as it is difficult to construct a perturbation which changes the neighbors at every layer. Consequently, conformity is a better uncertainty metric for both regular examples and out-of-domain examples such as noisy or adversarial inputs, making it suitable for interpreting models.
algorithm requires fixed-size vector representations. To reach a fixed-size representation for text classification, we take either the final hidden state of a recurrent neural network or use max pooling across timeCollobert and Weston (2008). We consider deep architectures of these two forms, using each of the layers’ representations as the features for DkNN.
Using conformity, we generate interpretations through a modified version of leave-one-out Li et al. (2016). After removing a word, rather than observing the drop in confidence, we instead measure the drop in conformity. Formally, we modify classifier in Equation 1 to output probabilities based on conformity. We refer to this method as conformity leave-one-out.
Interpretability should not come at the cost of performance—before investigating how interpretable DkNN is, we first evaluate its accuracy. We experiment with six text classification tasks and two models, verifying that DkNN achieves accuracy comparable to regular classifiers.
We consider six common text classification tasks: binary sentiment analysis using Stanford Sentiment Treebank (Socher et al., 2013, sst) and Customer Reviews (Hu and Liu, 2004, cr), topic classification using TREC Li and Roth (2002), opinion polarity (Wiebe et al., 2005, mpqa), and subjectivity/objectivity (Pang and Lee, 2004, subj). Additionally, we consider natural language inference with snli Bowman et al. (2015). We experiment with BiLSTM and CNN models.
Our CNN architecture resembles Kim (2014). We use convolutional filters of size three, four, and five, with max-pooling over time Collobert and Weston (2008). The filters are followed by three fully-connected layers. We fine-tune GloVe embeddings Pennington et al. (2014) of each word. For DkNN, we use the activations from the convolution layer and the three fully-connected layers.
Our architecture uses a bidirectional LSTM Graves and Schmidhuber (2005), with the final hidden state forming the fixed-size representation. We use three LSTM layers, followed by two fully-connected layers. We fine-tune GloVe embeddings of each word. For DkNN, we use the final activations of the three recurrent layers and the two fully-connected layers.
Unlike the other tasks which consist of a single input sentence, snli has two inputs, a premise and hypothesis. Following Conneau et al. (2017), we use the same model to encode the two inputs, generating representations for the premise and for the hypothesis. We concatenate these two representations along with their dot-product and element-wise absolute difference, arriving at a final representation . This vector passes through two fully-connected layers for classification. For DkNN, we use the activations of the two fully-connected layers.
DkNN achieves comparable accuracy on the five classification tasks (Table 1). On snli, the BiLSTM achieves an accuracy of 81.2% with a softmax classifier and 81.0% with DkNN.
dataset for generating interpretations. Due to the lack of standard interpretation evaluation metricsDoshi-Velez and Kim (2017), we use qualitative evaluations Smilkov et al. (2017); Sundararajan et al. (2017); Li et al. (2016), performing quantitative experiments where possible to examine the distinction between the interpretation methods.
We compare our method (Conformity leave-one-out) against two baselines: leave-one-out using regular confidence (Confidence leave-one-out, see Section 2.1) and the gradient with respect to the input (Gradient, see Section 2.2). To create saliency maps, we normalize each word’s importance by dividing it by the total importance of the words in the sentence. We display unknown words in angle brackets . Table 2 shows sst interpretation examples for the BiLSTM model and further examples are shown on a supplementary website.333https://sites.google.com/view/language-dknn/
Conformity leave-one-out assigns concentrated importance values to a small number of input words. In contrast, the baseline methods assign non-zero importance values to numerous words, many of which are irrelevant. For instance, in all three examples of Table 2, both baselines highlight almost half of the input, including words such as “fiction” and “clash”. We suspect model confidence is oversensitive to these unimportant input changes, causing the baseline interpretations to highlight unimportant words. On the other hand, the conformity score better separates word importance, generating clearer interpretations.
The tendency for confidence-based approaches to assign importance to many words holds for the entire test set. We compute the average number of highlighted words using a threshold of (a normalized importance value corresponding to a or highlight). Out of the average 20.23 words in sst test set, gradient highlights 5.32 words, confidence leave-one-out highlights 5.79 words, and conformity leave-one-out highlights 3.65 words.
The second, and related, observation for confidence-based approaches is a bias towards selecting word importance based on a word’s inherent sentiment, rather than its meaning in context. For example, see “clash”, “terribly”, and “unfaithful” in Table 2. The removal of these words causes a small change in the model confidence. When using DkNN, the conformity score indicates that the model’s uncertainty has not risen without these input words and leave-one-out does not assign them any importance.
We characterize our interpretation method as significantly higher precision, but slightly lower recall than confidence-based methods. Conformity leave-one-out rarely assigns high importance to words that do not align with human perception of sentiment. However, there are cases when our method does not assign significant importance to any word. This occurs when the input has a high redundancy. For example, a positive movie review that describes the sentiment in four distinct ways. In these cases, leaving out a single sentiment word has little effect on the conformity as the model’s representation remains supported by the other redundant features. Confidence-based interpretations, which interpret models using the linear units that produce class scores, achieve higher recall by responding to every change in the input for a certain direction but may have lower precision.
In the second example of Table 2, the word “terribly” is assigned a negative importance value, disregarding its positive meaning in context. To examine if this is a stand-alone example or a more general pattern of uninterpretable behavior, we calculate the importance value of the word “terribly” in other positive examples. For each occurrence of the word “great” in positive validation examples, we paraphrase it to “awesome”, “wonderful”, or “impressive”, and add the word “terribly” in front of it. This process yields examples. For each of these examples, we compute the importance value of each input word and rank them from most negative to most positive (the most negative word has a rank of 1). We compare the average ranking of “terribly” from the three methods: 7.9 from conformity leave-one-out, 1.68 from confidence leave-one-out, and 1.1 from gradient. The baseline methods consistently rank “terribly” as the most negative word, ignoring its meaning in context. This echoes our suspicion: DkNN generates interpretations with higher precision because conformity is robust to irrelevant input changes.
We use conformity leave-one-out to interpret a model trained on snli, a dataset known to contain annotation artifacts. We demonstrate that our interpretation method can help identify when models exploit dataset biases.
Recent studies Gururangan et al. (2018); Poliak et al. (2018) identify annotation artifacts in snli. Superficial patterns exist in the input which strongly correlate with certain labels, making it possible for models to “game” the task: obtain high accuracy without true understanding. For instance, the hypothesis of an entailment example is often a general paraphrase of the premise, using words such as “outside” instead of “playing in a park”. Contradiction examples often contain negation words or non-action verbs like “sleeping”. Models trained solely on the hypothesis can learn these patterns and reach accuracies considerably higher than the majority baseline.
These studies indicate that the snli task can be gamed. We look to confirm that some artifacts are indeed exploited by normally trained models that use full input pairs. We create saliency maps for examples in the validation set using conformity leave-one-out. Table 3 shows samples and more can be found on the supplementary website. We use blue highlights to indicate words which positively support the model’s predicted class, and the color red to indicate words that support a different class. The first example is a randomly sampled baseline, showing how the words “swims” and “pool” support the model’s prediction of contradiction. The other examples are selected because they contain terms identified as artifacts. In the second example, conformity leave-one-out assigns extremely high word importance to “sleeping”, disregarding the other words necessary to predict contradiction (i.e., the neutral class is still possible if “pets” is replaced with “people”). In the final two hypotheses, the interpretation method diagnoses the model failure, assigning high importance to “wearing”, rather than focusing positively on the shirt color.
To explore this further, we analyze the hypotheses in each snli class which contain a top five artifact identified by Gururangan et al. (2018). For each of these examples, we compute the importance value for each input word using both confidence and conformity leave-one-out. We then rank the words from most important for the prediction to least important (a score of 1 indicates highest importance) and report the average rank for the artifacts in Table 4. We sort the words by their Pointwise Mutual Information with the correct label as provided by Gururangan et al. (2018). The word “nobody” particularly stands out: it is the most important input word every time it appears in a contradiction example.
For most of the artifacts, conformity leave-one-out assigns them a high importance, often ranking the artifacts as the most important input word. Confidence leave-one-out correlates less strongly with the known artifacts, frequently ranking them as low as the fifth or sixth most important word. Given the high correlation between conformity leave-one-out and the manually identified artifacts, this interpretation method may serve as a technique to identify undesirable biases a model has learned.
|Contradiction||Premise||a young boy reaches for and touches the propeller of a vintage aircraft.|
|Entailment||Premise||a brown a dog and a black dog in the edge of the ocean with a wave under them boats are on the water in the background.|
|Premise||man in a blue shirt standing in front of a structure painted with geometric designs.|
We connect the improvements made by conformity leave-one-out to model confidence issues, compare alternative interpretation improvements, and discuss further features of DkNN.
Many existing feature attribution methods rely on estimates of model uncertainty: both input gradient and confidence leave-one-out rely on prediction confidence, our method relies on DkNN conformity. Interpretation quality is thus determined by reliable uncertainty estimation. For instance, past work shows relying on neural network confidence can lead to unreasonable interpretations Kindermans et al. (2017); Ghorbani et al. (2017); Feng et al. (2018). Independent of interpretability, Guo et al. (2017) show that neural network confidence is unreasonably high: on held-out examples, it far exceeds empirical accuracy. This is further exemplified by the high confidence predictions produced on inputs that are adversarial Szegedy et al. (2014) or contain solely noise Goodfellow et al. (2015).
We attribute one interpretation failure to neural network confidence issues. Guo et al. (2017) study overconfidence and propose a calibration procedure using Platt scaling, which adjusts the temperature parameter of the softmax function to align confidence with accuracy on a held-out dataset. However, this is not input dependent—the confidence is lower for both full-length examples and ones with words left out. Hence, selecting influential words will remain difficult.
To verify this, we create an interpretation baseline using temperature scaling. The results corroborate the intuition: calibrating the confidence of leave-one-out does not improve interpretations. Qualitatively, the calibrated interpretation results remain comparable to confidence leave-one-out. Furthermore, calibrating the DkNN conformity score as in Papernot and McDaniel (2018) did not improve interpretability compared to the uncalibrated conformity score.
both aggregate gradient values over multiple backpropagation passes to eliminate local noise or satisfy interpretation axioms. This work does not address model confidence and is orthogonal to ourDkNN approach.
Retrieval-Augmented Convolutional Neural NetworksZhao and Cho (2018) are similar to DkNN: they augment model predictions with an information retrieval system that searches over network activations from the training data.
Retrieval-Augmented models and DkNN can both select influential training examples for a test prediction. In particular, the training data activations which are closest to the test point’s activations are influential according to the model. These training examples can provide interpretations as a form of analogy Caruana et al. (1999), an intuitive explanation for both machine learning experts and non-experts Klein (1989); Kim et al. (2014); Koh and Liang (2017); Wallace and Boyd-Graber (2018). However, unlike in computer vision where training data selection using DkNN yielded interpretable examples Papernot and McDaniel (2018), our experiments did not find human interpretable data points for sst or snli.
Model confidence is important for real-world applications: it signals how much one should trust a neural network’s predictions. Unfortunately, users may be misled when a model outputs highly confident predictions on rubbish examples Goodfellow et al. (2015); Nguyen et al. (2015) or adversarial examples Szegedy et al. (2014). Recent work decides when to trust a neural network model Ribeiro et al. (2016); Doshi-Velez and Kim (2017); Jiang et al. (2018). For instance, analyzing local linear model approximations Ribeiro et al. (2016)
or flagging rare network activations using kernel density estimationJiang et al. (2018). The DkNN conformity score is a trust metric that helps defend against image adversarial examples Papernot and McDaniel (2018). Future work should study if this robustness extends to interpretations.
A robust estimate of model uncertainty is critical to determine feature importance. The DkNN conformity score is one such uncertainty metric which leads to higher precision interpretations. Although DkNN is only a test-time improvement—the model is still trained using maximum likelihood. Combining nearest neighbor and maximum likelihood objectives during training may further improve model accuracy and interpretability. Moreover, other uncertainty estimators do not require test-time modifications. For example, modeling and using Bayesian Neural Networks Gal et al. (2016).
Similar to other nlp interpretation methods Sundararajan et al. (2017); Li et al. (2016), conformity leave-one-out works when a model’s representation has a fixed size. For other nlp tasks, such as structured prediction (e.g., translation and parsing) or span prediction (e.g., extractive summarization and reading comprehension), models output a variable number of predictions and our interpretation approach will not suffice. Developing interpretation techniques for these types of models is a necessary area for future work.
We apply DkNN to neural models for text classification. This provides a better estimate of model uncertainty—conformity—which we combine with leave-one-out. This overcomes issues stemming from neural network confidence, leading to higher precision interpretations. Most interestingly, our interpretations are supported by the training data, providing insights into the representations learned by a model.
Feng was supported under subcontract to Raytheon BBN Technologies by DARPA award HR0011-15-C-0113. JBG is supported by NSF Grant IIS1652666. Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the view of the sponsor. The authors would like to thank the members of the CLIP lab at the University of Maryland and the anonymous reviewers for their feedback.
A unified architecture for natural language processing: Deep neural networks with multitask learning.In ICML.
Uncertainty in Deep Learning. Ph.D. thesis, University of Oxford.