Modern machine learning models can be difficult to probe and understand after they have been trained. This is a major problem for the field, with consequences for trustworthiness, diagnostics, debugging, robustness and a range of other engineering and human interaction issues surrounding the deployment of a model.
There have been several lines of attack on this problem. One involves changing the model design or training process so as to enhance interpretability. This can involve retreating to simpler models and/or incorporating strong regularizers that effectively simplify a complex model. In both cases, however, there is a possible loss of prediction accuracy. Models can also be changed in more sophisticated ways to enhance interpretability; for example, attention-based methods have yielded deep models for vision and language tasks that improve interpretability at no loss to prediction accuracy [1, 37, 12, 6, 38, 36, 35].
Another approach treats interpretability as a separate problem from prediction. Given a predictive model, an interpretation method yields, for each instance to which the model is applied, a vector of importance scores associated with the underlying features. Within this general framework, methods can be classified as being model-agnostic or model-aware. Model-aware methods require additional assumptions, or are specific to a certain class of models[31, 2, 30, 15, 34, 10]. Model-agnostic methods can be applied in a black-box manner to arbitrary models [26, 3, 22, 33, 8, 21].
While the generality of the stand-alone approach to interpretation is appealing, current methods provide little opportunity to leverage prior knowledge about what constitutes a satisfying interpretation in a given domain. Such methods have been studied most notably in the setting of natural-language processing (NLP), where there is an ongoing effort to incorporate linguistic structure (syntactic, semantic and pragmatic) in machine learning models. Such structure can be brought to bear in the model, the interpretation of a model, or both. For example,Socher et al.  introduced a recursive deep model to understand and leverage compositionality in tasks such as sentiment detection. Lei et al.  proposed to use a combination of two modular components, generator and encoder, to explicitly generate rationales and make prediction for NLP tasks.
Compositionality, or the rules used to construct a sentence from its constituent expressions, is an important property of natural language. While current interpretation methods fall short of quantifying compositionality directly, there has been a growing interest in investigating the manner in which existing deep models capture the interactions between constituent expressions that are critical for successful prediction [20, 19, 21, 10]. However, existing approaches often lack a systematic treatment in quantifying interactions, and the generality to be applied to arbitrary models.
In the current paper, we focus on the model-agnostic interpretation of NLP models. Our approach quantifies the importance of words by leveraging the syntactic structure of linguistic data, as represented by constituency-based parse trees. In particular, we develop the LS-Tree value, a procedure that provides instance-wise importance scores for a model by minimizing the sum-of-squared residuals at every node of a parse tree for the sentence in consideration. We provide theoretical support for this by relating it to the Banzhaf value in coalitional game theory .
Our framework also provides a seedbed for studying compositionality in natural language. Based on the LS-Tree value, we develop a novel method for quantifying interactions between sibling nodes on a parse tree captured by the target model, by exploiting Cook’s distance in linear regression
. We show that the proposed algorithm can be used to analyze several aspects of widely-used NLP models, including nonlinearity, the ability to capture adversative relations, and overfitting. In particular, we carry out a series of experiments studying four models—a linear model with Bag-Of-Word features, a convolutional neural network, an LSTM , and the recently proposed BERT model .
2 Least squares on parse trees
For simplicity, we restrict ourselves to classification. Assume a model maps a sentence to a vector of class probabilities. We useto denote the function that maps an input sentence to the log probability score of a selected class. Let denote the powerset of . The parse tree maps the sentence to a collection of subsets, denoted as , where each subset contains the indices of words corresponding to one node in the parse tree. See Figure 1 for an example. By abuse of notation, we use to denote the output of the model evaluated on the words with indices
, with the rest of the words replaced by zero paddings or some reference placeholder. We calldefined by a characteristic function, which captures the importance of each word subset to the prediction.
We seek the optimal linear function on the Boolean hypercube to approximate the characteristic function on , and use the coefficients as importance scores assigned to each word. Concretely, we solve the following least squares problem:
where component of the optimal is the importance score of word with index . We name the map from to the solution to Equation (1) the LS-Tree value, because it results from least squares (LS) on parse trees, and can be considered as a value in coalitional game theory.
3 Connection to coalitional game theory
In this section, we give an interpretation of the LS-Tree value from the perspective of coalitional game theory.
Model interpretation has been studied using tools from coalitional game theory [33, 8, 22, 5]. We build on this line of research by considering a restriction on coalitions induced by the syntactic structure of the input.
Let be the collection of word subsets constructed from the parse tree. Taking each word as a player, we can define a coalitional game between words in a sentence as a pair , where enforces restrictions on coalition among players and with is the characteristic function defined by the model evaluated on each coalition. A value is a mapping that associates a -dimensional payoff vector to each game , each entry corresponding to a word. The value provides rules which give allocations to each player for any game.
The problem of defining a fair value in the setting of full coalition (when ) has been studied extensively in coalitional game theory [28, 4]. One popular value is the Banzhaf value introduced by Banzhaf III . For each it defines the value:
The Banzhaf value can be characterized as the unique value that satisfies the following four properties :
i) Symmetry: If for all , we have .
ii) Dummy player property: If for all , we have .
iii) Marginal contributions: For any two characteristic functions such that for any , we have .
iv) 2-Efficiency: If merges into a new player , then , where if and otherwise, for any .
These properties are natural for allocation of importance to prediction in model interpretation. Symmetry states that two features have the same allocation if their marginal contributions to feature subsets are the same. The dummy property states that a feature is allocated the same amount as the contribution of itself alone if its marginal contribution always equals the model evaluation on its own. The linear model yields such an example. Marginal contributions states that a feature which has the same marginal contribution between two models for any word subset has the same amount of allocation. 2-Efficiency states that allocation of importance is immune to artificial merging of two features.
To employ game-theoretic concepts such as the Banzhaf value in the interpretation of NLP models, we need to recognize that arbitrary combinations of words are not likely to be accepted as valid interpretations by humans. We might wish to start with a set of combinations that are likely to be interpretable by humans, and can be obtained via human-interpretable data, and then define the worth of other combinations of words via extrapolation. It turns out that the LS-Tree value as defined in the previous section can be interpreted as exactly such an extrapolation, where each node of the parse tree represents an interpretable word combination:
Suppose a value coincides with the Banzhaf value for any game of full coalition, and for every game with restricted coalition, it is consistent under the addition of an arbitrary subset :
where is defined as for and . Then coincides with the LS-Tree value.
It was shown in Hammer and Holzman  that the Banzhaf value assigns to each player the corresponding coefficient in the best linear approximation of . That is,
Following the proof of Theorem 3.3 in Katsev 111The original theorem is established for the solution to Problem (3) with the efficiency constraint that . But the same proof follows for the unconstrained version., it follows directly that , as is defined by Equation (3), is the unique value that coincides with with full coalition and is consistent under the addition of an arbitrary subset:
Taking , the theorem is established. ∎
4 Detecting interactions
We aim to detect and quantify interactions between words in a sentence that have been captured by the target model. While there are exponentially many possible interactions between arbitrary words, we restrict ourselves to the ones permitted by the structure of language. Concretely, we focus on interactions between siblings, or nodes with a common parent, in the parse tree. As an example, node in Figure 1 represents interaction between “is,” “not” and “heartwarming or entertaining.”
We define interaction as deviation of composition from linearity in a given sentence. As a result, all non-leaf nodes in the tree are expected to admit zero interaction for a linear model. The above definition suggests that interaction can be quantified by studying how the inclusion of a common parent representing the interaction affects the coefficients of the linear approximation of the model.
Cook’s distance is a classic metric in linear regression that captures the influence of a data point . It is defined as a constant multiple of the squared distance between coefficients after a data point is moved, where the distance metric is defined by the data matrix :
are the least squares estimate with theth data point deleted and the original least squares estimate respectively. A larger Cook’s distance indicates a larger influence of the corresponding data point.
In our setting, the data matrix is a Boolean matrix where each row corresponds to a node in the tree, and an entry is one if and only if the word of the corresponding index lies in the subtree of the node. To capture the interaction of a non-leaf node (corresponding to some ), it does not suffice to only delete the corresponding row, because all of its ancestor nodes contain the segment represented by the node as well. To deal with this issue, we compute the distance between the least squares estimate with the rows corresponding to the node and all of its ancestors deleted, and the least squares estimate with only the rows corresponding to the ancestors deleted:
where denote the estimates with all its ancestors, including and excluding node , deleted222Only entries corresponding to words within node will differ between and , but we retain the remaining entries for notational simplicity.. Cook’s distance no longer has its statistical meaning here, as the normality assumption of the linear model no longer holds. A natural choice is the Euclidean distance , which was also introduced by Cook . One drawback of the Euclidean distance is that it is unable to capture the direction of interaction. When this is an issue, we may use a signed distance: , which sums up the influence of introducing the extra row on every coefficient of the linear model. We call the score defined by and absolute and signed LS-Tree interaction scores respectively, as they are constructed from the LS-Tree value.
We propose an iterative algorithm to efficiently compute the interaction of each node on a tree with nodes. As a first step, model evaluations are performed, one evaluation for each node. For a node , we denote as the set of its children, and the data matrices excluding the ancestors of , further excluding and including itself respectively, and the row corresponding to node . The interaction score of each is a function of . Denote . For each non-leaf node , is of full rank and thus invertible. We show how and can be computed from and . In fact, with an application of the Sherman-Morrison formula , we have
Rearranging the terms in Equation (5), we have
With another application of the Sherman-Morrison formula, we have
For leaf nodes, the entry of corresponding to is set to zero, with the remaining entries equal to those of . This is a result of the minimal Euclidean norm solution of Problem 1, obtained from the pseudoinverse of . Consequently, the (signed) interaction score of a leaf equals the model evaluation on the leaf alone.
We summarize the derivation in Algorithm 1, which traverses on the parse tree from root to leaves in a top-down fashion to compute the interaction scores of each node. As the number of nodes in a parse tree is linear in the number of words, Algorithm 1 is of complexity , plus the complexity of parsing the sentence, which is in our experiments, and model evaluations. Figure 1 shows how Algorithm 1 assigns signed interaction scores to a given example.
We carry out experiments to analyze the performance of four different models: Bag of Words (BoW), Word-based Convolutional Neural Network (CNN) 
, bidirectional Long Short-Term Memory network (LSTM), and Bidirectional Encoder Representations from Transformers (BERT) , across three sentiment data sets of different sizes: Stanford Sentiment Treebank (SST) , IMDB Movie reviews  and Yelp reviews Polarity . For an instance with multiple sentences, we parse each sentence separately, and introduce an extra node as the common parent of all roots. Interactions between sentences are not considered in our experiments.
BoW fits a linear model on the Bag-of-Words features. Both CNN and LSTM use a 300-dimensional GloVe word embedding 
. The CNN is composed of three 100-dimensional convolutional 1D layers with 3, 4 and 5 kernels respectively, concatenated and fed into a max-pooling layer followed by a hidden dense layer. The LSTM uses a bidirectional LSTM layer with 128 units for each direction. BERT pre-trains a deep bidirectional Transformer on a large corpus of text by jointly conditioning on both left and right context in all layers. It has achieved state-of-the-art performance on a large suite of sentence-level and token-level tasks. See Table 1 for a summary of data sets and the accuracies of the four models.
We use the Stanford constituency parser [11, 27, 40, 41] for all the experiments. It is a transition-based parser that is faster than chart-based parsers yet achieves comparable accuracy, by employing a set of shift-reduce operations and making use of non-local features.
5.1 Deviation from linearity
We quantify the deviation of three nonlinear models from a linear model via the proposed LS-Tree value and interaction scores, both for specific instances and on a data set.
The LS-Tree value can be interpreted as supplying the coefficients of the best linear model used to approximate the target model locally for each instance. The correlation between the LS-Tree value and the global linear model with Bag of Words (BoW) features can be used as a measure of nonlinearity of the target model at the instance. Table 3 shows three examples in SST, correctly classified by both BERT and BoW. BERT has low and high correlations with linear models at the first and second examples in Table 3 respectively. In particular, the top keywords, as ranked by the LS-Tree value, are different between two models.
|Data Set||Classes||Train Samples||Test Samples||Avg. Length||BoW||CNN||LSTM||BERT|
The average of correlation with BoW across instances can be used as a measure of nonlinearity on a certain data set. The average correlation of BoW, CNN, LSTM and BERT with a linear model is shown in Table 2, which indicates that BERT is the most nonlinear model among the four. CNN is more nonlinear than LSTM on IMDB but comparably nonlinear on SST and Yelp.
|Even if you do n’t think kissinger’s any more guilty of criminal activity than most contemporary statesmen, he’d sure make a courtroom trial great fun to watch.||Even if you don’t think kissinger’s any more guilty of criminal activity than most contemporary statesmen, he’d sure make a courtroom trial great fun to watch.||Positive||0.173||11|
|The problem with this film is that it lacks focus.||The problem with this film is that it lacks focus.||Negative||0.939||1|
|Funny but perilously slight.||Funny but perilously slight.||Positive||0.938||4|
Correlation alone may not suffice to capture the nonlinearity of a model. For example, the third sentence in Table 3 has a relatively high correlation, but the bottom left parse tree in Figure 2 indicates that the top interaction ranked by the signed interaction score is the node combining “funny” with “but perilously slight”. This indicates the BERT model has captured the adversative conjunction, which BoW is not capable of. The ability to capture closer-to-the-top nodes in a parse tree is an indication of nonlinearity of the model. To quantify this ability, we define the depth of a node in the parse tree as the maximum distance of the node from the bottom:
For a linear model, all non-leaf nodes have zero interaction, and thus the top ranked nodes are of depth 1, until all leaves with positive weights are enumerated. The higher the depth of top-ranked nodes, the more nonlinear a model is at a specific instance.
The average depths of top nodes ranked by interaction scores across instances can be used as a measure of the nonlinearity of the model on that data set. Figure 3 compares the average depths across BoW, CNN, LSTM and BERT on the three data sets, with top words selected. BoW is used as a baseline whose non-leaf nodes have zero interaction scores. We use the absolute interaction scores here to capture all interactions, no matter in the same or opposite direction of prediction. BERT is still the most capable of capturing deeper interactions, followed by CNN and LSTM. CNN turns out to be a more nonlinear model than LSTM on Yelp, which was not captured by correlation.
|Dataset||Model||Avg. Score||not||but||yet||though||although||even though||whereas||except||despite||in spite of|
|… He said he couldn’t help. We had to walk while the snow blew in our faces. When we were almost there, we saw the shuttle pull out with the smoking shuttle driver in it, driving in the opposite direction, away from us. I can not believe how rude they were.||during the time that||0.000(0.338)||0.781(0.300)||1.761(0.839)||0.062(0.092)|
|… I ordered a cappuccino. It tasted like milk and no coffee. I was exceptionally disappointed. So while the place has a great reputation, even they can screw it up if they don t pay attention to detail, and at this level they should never screw it up. I had a better cup at Martys Market for crying out loud!||whereas (indicating a contrast)||0.000(0.338)||1.142(0.300)||2.155(0.839)||2.167(0.092)|
|Usually asking the server what is her favorite dish gets you a pretty good recommendation, but in this case, it was crap! The smoked brisket had that discoloration that happens to meat when it’s been sitting out for a while. And it wasn’t even tender!! Am I asking for too much?||a period of time||0.000(0.338)||0.206(0.300)||0.465(0.839)||0.082 (0.092)|
5.2 Adversative relations
Adversative words are those which express opposition or contrast. They often play an important role in determining the sentiment of an instance, by reversing the sentiment of a preceding or succeeding word, phrase or statement. We focus on four types of adversative words: negation that reverses the sentiment of a phrase or word (e.g., “not”), adversative coordinating conjunctions that express opposition or contrast between two statements (e.g., “but” and “yet”), subordinating conjunctions indicating adversative relationship (e.g., “though,” “although,” “even though,” and “whereas”), prepositions that precede and govern nouns adversatively (e.g., “except,” “despite” and “in spite of”).
In most cases, adversative words only function if they interact with their preceding or succeeding companion. In order to verify whether models are able to capture the adversative relationship, we examine the LS-Tree interaction scores of the parent nodes of these words.
We extract all instances that contain any of the above adversative words. Then for each word in an instance, we compute the interaction score of the corresponding node with the word alone, and that of its parent node. A high interaction score on the node with the adversative word alone indicates the model inappropriately attributes to the word itself a negative or positive sentiment. A high interaction score on the parent node indicates the model captures the interaction of the adversative word with its preceding or succeeding component. To compare across different models, we further compute the average interaction score of a generic node across all instances, and report the ratio of average interaction scores of specific nodes to the average score of a generic node for respective models.
Table 4 reports the results on three data sets. We observe the ability of capturing adversative relation for different models varies across data sets. BERT takes the lead in capturing adversative relations on SST and IMDB, perhaps with the help of BERT’s pre-training process on a large corpus, but CNN and LSTM catch up with BERT on Yelp, which has a larger amount of training data. On the other hand, all models assign a high score on nodes with adversative words alone. This perhaps results from the uneven distribution of adversative words like “not” among the positive and negative classes. An additional observation is that BERT has the highest score for a generic node on average across three data sets, indicating that BERT is the most sensitive to words and interactions on average.
Some words have different meanings in different contexts. It is interesting to investigate whether a model can distinguish the same word under different contexts. The word “while” is such an example. Table 5 shows three Yelp reviews that include “while.” It can be observed that the scores of the parent nodes of “while” is higher than average when “while” contains an adversative meaning, but lower otherwise. This observation holds across CNN, LSTM and BERT, with the sharpest distinction on BERT.
The three figures in Line 1 plot training and test loss of CNN, LSTM, BERT respectively. The figures in Line 2 plot the corresponding average variance of interaction scores across instances over training and test sets. The figures in Line 3 show p-values of permutation tests ofiterations with randomly selected instances in training and test sets respectively.
5.3 Detecting overfitting
Overfitting happens when a model captures sampling noise in training data, while failing to capture underlying relationships between the inputs and outputs. Overfitting can be a problem in modern machine learning models like deep neural networks, due to their expressive nature. To mitigate overfitting, one often splits the initial training set into a training and a validation set, and uses the latter to obtain an estimate of the generalization performance . This leads to a waste of training data, depriving the model of potential opportunities to learn from the labelled validation data. We observe that the LS-Tree interaction scores can be used to construct a diagnostic for overfitting, one which is solely computed with unlabelled data.
Figure 4 shows the histograms of absolute interaction scores on small subsets of training and test data of SST, for an overfitted BERT model. The scores are more spread out on test data than those on training data. In fact, we have observed this phenomenon holds true on average across instances for a overfitted model. In particular, interaction scores of test instances have a larger variance on average than those of training instances when the model is overfitted, but comparable otherwise. The observation can also be generalized to other types of neural networks, including CNN and LSTM. We show in Figure 5
the average variance on training and test sets for CNN, LSTM and BERT models against training epochs, together with the loss curves. We observe that overfitting occurs when the variances between training and test sets differ.
The observation suggests we may use the difference of average variances of interaction scores between training and test sets as a diagnostic for overfitting. In particular, a permutation test can be carried out under the null hypothesis of equal average variance. The resulting p-values are plotted against the number of training epochs in the third line of Figure5. It can be observed that p-values fall below the significance level of when overfitting occurs, which suggests the rejection of the null hypothesis as an early stopping criterion in training.
We have proposed the LS-Tree value as a fundamental quantity for interpreting NLP models. This value leverages a constituency-based parser so that syntactic structure can play a role in determining interpretations. We have also presented an algorithm based on the LS-Tree value for detecting interactions between siblings of a parse tree. To the best of our knowledge, this is the first model-interpretation algorithm to quantify the interaction between words for arbitrary NLP models. We have applied the proposed algorithm to the problem of assessing the nonlinearity of common neural network models and the effect of adversative relations on the models. We have presented a permutation test based on the LS-Tree interaction scores as a diagnostic for overfitting.
- Ba et al.  Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755, 2014.
- Bach et al.  Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS One, 10(7):e0130140, 2015.
- Baehrens et al.  David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and Klaus-Robert Müller. How to explain individual classification decisions. Journal of Machine Learning Research, 11:1803–1831, 2010.
- Banzhaf III  John F Banzhaf III. Weighted voting doesn’t work: A mathematical analysis. Rutgers L. Rev., 19:317, 1964.
- Chen et al.  Jianbo Chen, Le Song, Martin J. Wainwright, and Michael I. Jordan. L-shapley and C-shapley: Efficient model interpretation for structured data. In International Conference on Learning Representations, 2019.
- Chen et al.  Kan Chen, Jiang Wang, Liang-Chieh Chen, Haoyuan Gao, Wei Xu, and Ram Nevatia. ABC-CNN: An attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960, 2015.
- Cook  R Dennis Cook. Detection of influential observation in linear regression. Technometrics, 19(1):15–18, 1977.
- Datta et al.  Anupam Datta, Shayak Sen, and Yair Zick. Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems. In Security and Privacy (SP), 2016 IEEE Symposium on, pages 598–617. IEEE, 2016.
- Devlin et al.  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Godin et al.  F. Godin, K. Demuynck, J. Dambre, W. De Neve, and T. Demeester. Explaining character-aware neural networks for word-level prediction: Do they discover linguistic rules? In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3275–3284. ACL, 2018.
- Goldberg and Nivre  Yoav Goldberg and Joakim Nivre. A dynamic oracle for arc-eager dependency parsing. Proceedings of COLING 2012, pages 959–976, 2012.
- Gregor et al.  Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. Draw: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623, 2015.
- Hammer and Holzman  Peter L Hammer and Ron Holzman. Approximations of pseudo-boolean functions; applications to game theory. Zeitschrift für Operations Research, 36(1):3–21, 1992.
- Hochreiter and Schmidhuber  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
- Karpathy et al.  Andrej Karpathy, Justin Johnson, and Li Fei-Fei. Visualizing and understanding recurrent networks. In International Conference on Learning Representations, 2016.
- Katsev  Ilya Katsev. The least square values for games with restricted cooperation. In Game Theory and Management, page 117, 2011.
- Kim  Yoon Kim. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751. ACL, 2014.
- Larson  Selmer C Larson. The shrinkage of the coefficient of multiple correlation. Journal of Educational Psychology, 22(1):45, 1931.
- Lei et al.  Tao Lei, Regina Barzilay, and Tommi Jaakkola. Rationalizing neural predictions. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 107–117. ACL, 2016.
- Li et al. [2016a] Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. Visualizing and understanding neural models in NLP. In Proceedings of NAACL-HLT, pages 681–691, 2016a.
- Li et al. [2016b] Jiwei Li, Will Monroe, and Dan Jurafsky. Understanding neural networks through representation erasure. arXiv preprint arXiv:1612.08220, 2016b.
- Lundberg and Lee  Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30, pages 4765–4774. Curran Associates, Inc., 2017.
Maas et al. 
Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and
Learning word vectors for sentiment analysis.In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 142–150. ACL, 2011.
- Nowak  Andrzej S Nowak. On an axiomatization of the banzhaf value without the additivity axiom. International Journal of Game Theory, 26(1):137–141, 1997.
- Pennington et al.  Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543. ACL, 2014.
- Ribeiro et al.  Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should I trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144. ACM, 2016.
- Sagae and Lavie  Kenji Sagae and Alon Lavie. A classifier-based parser with linear run-time complexity. In Proceedings of the Ninth International Workshop on Parsing Technology, pages 125–132. ACL, 2005.
- Shapley  Lloyd S Shapley. A value for n-person games. Contributions to the Theory of Games, 2(28):307–317, 1953.
- Sherman and Morrison  Jack Sherman and Winifred J Morrison. Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. The Annals of Mathematical Statistics, 21(1):124–127, 1950.
- Shrikumar et al.  Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. In ICML, volume 70 of Proceedings of Machine Learning Research, pages 3145–3153. PMLR, 06–11 Aug 2017.
- Simonyan et al.  K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In Proceedings of the International Conference on Learning Representations (ICLR), 2014.
- Socher et al.  Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1631–1642. ACL, 2013.
- Štrumbelj and Kononenko  Erik Štrumbelj and Igor Kononenko. An efficient explanation of individual classifications using game theory. Journal of Machine Learning Research, 11:1–18, 2010.
- Sundararajan et al.  Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In International Conference on Machine Learning, pages 3319–3328, 2017.
- Vaswani et al.  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
Xu and Saenko 
Huijuan Xu and Kate Saenko.
Ask, attend and answer: Exploring question-guided spatial attention
for visual question answering.
European Conference on Computer Vision, pages 451–466. Springer, 2016.
- Xu et al.  Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning, pages 2048–2057, 2015.
Yang et al. 
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola.
Stacked attention networks for image question answering.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 21–29, 2016.
- Zhang et al.  Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems, pages 649–657, 2015.
- Zhang and Clark  Yue Zhang and Stephen Clark. Transition-based parsing of the chinese treebank using a global discriminative model. In Proceedings of the 11th International Conference on Parsing Technologies, pages 162–171. ACL, 2009.
- Zhu et al.  Muhua Zhu, Yue Zhang, Wenliang Chen, Min Zhang, and Jingbo Zhu. Fast and accurate shift-reduce constituent parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 434–443, 2013.