LS-Tree: Model Interpretation When the Data Are Linguistic

02/11/2019
by   Jianbo Chen, et al.
22

We study the problem of interpreting trained classification models in the setting of linguistic data sets. Leveraging a parse tree, we propose to assign least-squares based importance scores to each word of an instance by exploiting syntactic constituency structure. We establish an axiomatic characterization of these importance scores by relating them to the Banzhaf value in coalitional game theory. Based on these importance scores, we develop a principled method for detecting and quantifying interactions between words in a sentence. We demonstrate that the proposed method can aid in interpretability and diagnostics for several widely-used language models.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

08/08/2018

L-Shapley and C-Shapley: Efficient Model Interpretation for Structured Data

We study instancewise feature importance scoring as a method for model i...
11/10/2021

Beyond Importance Scores: Interpreting Tabular ML by Visualizing Feature Semantics

Interpretability is becoming an active research topic as machine learnin...
06/07/2021

Relative Importance in Sentence Processing

Determining the relative importance of the elements in a sentence is a k...
06/20/2022

SynWMD: Syntax-aware Word Mover's Distance for Sentence Similarity Evaluation

Word Mover's Distance (WMD) computes the distance between words and mode...
10/31/2018

TF-MoDISco v0.4.4.2-alpha: Technical Note

TF-MoDISco (Transcription Factor Motif Discovery from Importance Scores)...
09/02/2021

Integrated Directional Gradients: Feature Interaction Attribution for Neural NLP Models

In this paper, we introduce Integrated Directional Gradients (IDG), a me...
03/07/2018

Gerrymandering and Compactness: Implementation Flexibility and Abuse

The shape of an electoral district may suggest whether it was drawn with...

1 Introduction

Modern machine learning models can be difficult to probe and understand after they have been trained. This is a major problem for the field, with consequences for trustworthiness, diagnostics, debugging, robustness and a range of other engineering and human interaction issues surrounding the deployment of a model.

There have been several lines of attack on this problem. One involves changing the model design or training process so as to enhance interpretability. This can involve retreating to simpler models and/or incorporating strong regularizers that effectively simplify a complex model. In both cases, however, there is a possible loss of prediction accuracy. Models can also be changed in more sophisticated ways to enhance interpretability; for example, attention-based methods have yielded deep models for vision and language tasks that improve interpretability at no loss to prediction accuracy [1, 37, 12, 6, 38, 36, 35].

Another approach treats interpretability as a separate problem from prediction. Given a predictive model, an interpretation method yields, for each instance to which the model is applied, a vector of importance scores associated with the underlying features. Within this general framework, methods can be classified as being model-agnostic or model-aware. Model-aware methods require additional assumptions, or are specific to a certain class of models

[31, 2, 30, 15, 34, 10]. Model-agnostic methods can be applied in a black-box manner to arbitrary models [26, 3, 22, 33, 8, 21].

While the generality of the stand-alone approach to interpretation is appealing, current methods provide little opportunity to leverage prior knowledge about what constitutes a satisfying interpretation in a given domain. Such methods have been studied most notably in the setting of natural-language processing (NLP), where there is an ongoing effort to incorporate linguistic structure (syntactic, semantic and pragmatic) in machine learning models. Such structure can be brought to bear in the model, the interpretation of a model, or both. For example,

Socher et al. [32] introduced a recursive deep model to understand and leverage compositionality in tasks such as sentiment detection. Lei et al. [19] proposed to use a combination of two modular components, generator and encoder, to explicitly generate rationales and make prediction for NLP tasks.

Compositionality, or the rules used to construct a sentence from its constituent expressions, is an important property of natural language. While current interpretation methods fall short of quantifying compositionality directly, there has been a growing interest in investigating the manner in which existing deep models capture the interactions between constituent expressions that are critical for successful prediction [20, 19, 21, 10]. However, existing approaches often lack a systematic treatment in quantifying interactions, and the generality to be applied to arbitrary models.

In the current paper, we focus on the model-agnostic interpretation of NLP models. Our approach quantifies the importance of words by leveraging the syntactic structure of linguistic data, as represented by constituency-based parse trees. In particular, we develop the LS-Tree value, a procedure that provides instance-wise importance scores for a model by minimizing the sum-of-squared residuals at every node of a parse tree for the sentence in consideration. We provide theoretical support for this by relating it to the Banzhaf value in coalitional game theory [4].

Our framework also provides a seedbed for studying compositionality in natural language. Based on the LS-Tree value, we develop a novel method for quantifying interactions between sibling nodes on a parse tree captured by the target model, by exploiting Cook’s distance in linear regression

[7]

. We show that the proposed algorithm can be used to analyze several aspects of widely-used NLP models, including nonlinearity, the ability to capture adversative relations, and overfitting. In particular, we carry out a series of experiments studying four models—a linear model with Bag-Of-Word features, a convolutional neural network

[17], an LSTM [14], and the recently proposed BERT model [9].

2 Least squares on parse trees

For simplicity, we restrict ourselves to classification. Assume a model maps a sentence to a vector of class probabilities. We use

to denote the function that maps an input sentence to the log probability score of a selected class. Let denote the powerset of . The parse tree maps the sentence to a collection of subsets, denoted as , where each subset contains the indices of words corresponding to one node in the parse tree. See Figure 1 for an example. By abuse of notation, we use to denote the output of the model evaluated on the words with indices

, with the rest of the words replaced by zero paddings or some reference placeholder. We call

defined by a characteristic function, which captures the importance of each word subset to the prediction.

Figure 1: An example parse tree. Top left shows how each node corresponds to a word subset. Color indicates the direction and strength of interaction as assigned by Algorithm 1. Red is used for the direction of positive class, and blue otherwise.

We seek the optimal linear function on the Boolean hypercube to approximate the characteristic function on , and use the coefficients as importance scores assigned to each word. Concretely, we solve the following least squares problem:

(1)

where component of the optimal is the importance score of word with index . We name the map from to the solution to Equation (1) the LS-Tree value, because it results from least squares (LS) on parse trees, and can be considered as a value in coalitional game theory.

3 Connection to coalitional game theory

In this section, we give an interpretation of the LS-Tree value from the perspective of coalitional game theory.

Model interpretation has been studied using tools from coalitional game theory [33, 8, 22, 5]. We build on this line of research by considering a restriction on coalitions induced by the syntactic structure of the input.

Let be the collection of word subsets constructed from the parse tree. Taking each word as a player, we can define a coalitional game between words in a sentence as a pair , where enforces restrictions on coalition among players and with is the characteristic function defined by the model evaluated on each coalition. A value is a mapping that associates a -dimensional payoff vector to each game , each entry corresponding to a word. The value provides rules which give allocations to each player for any game.

The problem of defining a fair value in the setting of full coalition (when ) has been studied extensively in coalitional game theory [28, 4]. One popular value is the Banzhaf value introduced by Banzhaf III [4]. For each it defines the value:

The Banzhaf value can be characterized as the unique value that satisfies the following four properties [24]:

i) Symmetry: If for all , we have .

ii) Dummy player property: If for all , we have .

iii) Marginal contributions: For any two characteristic functions such that for any , we have .

iv) 2-Efficiency: If merges into a new player , then , where if and otherwise, for any .

These properties are natural for allocation of importance to prediction in model interpretation. Symmetry states that two features have the same allocation if their marginal contributions to feature subsets are the same. The dummy property states that a feature is allocated the same amount as the contribution of itself alone if its marginal contribution always equals the model evaluation on its own. The linear model yields such an example. Marginal contributions states that a feature which has the same marginal contribution between two models for any word subset has the same amount of allocation. 2-Efficiency states that allocation of importance is immune to artificial merging of two features.

To employ game-theoretic concepts such as the Banzhaf value in the interpretation of NLP models, we need to recognize that arbitrary combinations of words are not likely to be accepted as valid interpretations by humans. We might wish to start with a set of combinations that are likely to be interpretable by humans, and can be obtained via human-interpretable data, and then define the worth of other combinations of words via extrapolation. It turns out that the LS-Tree value as defined in the previous section can be interpreted as exactly such an extrapolation, where each node of the parse tree represents an interpretable word combination:

Theorem 1.

Suppose a value coincides with the Banzhaf value for any game of full coalition, and for every game with restricted coalition, it is consistent under the addition of an arbitrary subset :

(2)

where is defined as for and . Then coincides with the LS-Tree value.

Proof.

It was shown in Hammer and Holzman [13] that the Banzhaf value assigns to each player the corresponding coefficient in the best linear approximation of . That is,

Following the proof of Theorem 3.3 in Katsev [16]111The original theorem is established for the solution to Problem (3) with the efficiency constraint that . But the same proof follows for the unconstrained version., it follows directly that , as is defined by Equation (3), is the unique value that coincides with with full coalition and is consistent under the addition of an arbitrary subset:

(3)

Taking , the theorem is established. ∎

4 Detecting interactions

We aim to detect and quantify interactions between words in a sentence that have been captured by the target model. While there are exponentially many possible interactions between arbitrary words, we restrict ourselves to the ones permitted by the structure of language. Concretely, we focus on interactions between siblings, or nodes with a common parent, in the parse tree. As an example, node in Figure 1 represents interaction between “is,” “not” and “heartwarming or entertaining.”

We define interaction as deviation of composition from linearity in a given sentence. As a result, all non-leaf nodes in the tree are expected to admit zero interaction for a linear model. The above definition suggests that interaction can be quantified by studying how the inclusion of a common parent representing the interaction affects the coefficients of the linear approximation of the model.

Cook’s distance is a classic metric in linear regression that captures the influence of a data point [7]. It is defined as a constant multiple of the squared distance between coefficients after a data point is moved, where the distance metric is defined by the data matrix :

where and

are the least squares estimate with the

th data point deleted and the original least squares estimate respectively. A larger Cook’s distance indicates a larger influence of the corresponding data point.

In our setting, the data matrix is a Boolean matrix where each row corresponds to a node in the tree, and an entry is one if and only if the word of the corresponding index lies in the subtree of the node. To capture the interaction of a non-leaf node (corresponding to some ), it does not suffice to only delete the corresponding row, because all of its ancestor nodes contain the segment represented by the node as well. To deal with this issue, we compute the distance between the least squares estimate with the rows corresponding to the node and all of its ancestors deleted, and the least squares estimate with only the rows corresponding to the ancestors deleted:

(4)

where denote the estimates with all its ancestors, including and excluding node , deleted222Only entries corresponding to words within node will differ between and , but we retain the remaining entries for notational simplicity.. Cook’s distance no longer has its statistical meaning here, as the normality assumption of the linear model no longer holds. A natural choice is the Euclidean distance , which was also introduced by Cook [7]. One drawback of the Euclidean distance is that it is unable to capture the direction of interaction. When this is an issue, we may use a signed distance: , which sums up the influence of introducing the extra row on every coefficient of the linear model. We call the score defined by and absolute and signed LS-Tree interaction scores respectively, as they are constructed from the LS-Tree value.

We propose an iterative algorithm to efficiently compute the interaction of each node on a tree with nodes. As a first step, model evaluations are performed, one evaluation for each node. For a node , we denote as the set of its children, and the data matrices excluding the ancestors of , further excluding and including itself respectively, and the row corresponding to node . The interaction score of each is a function of . Denote . For each non-leaf node , is of full rank and thus invertible. We show how and can be computed from and . In fact, with an application of the Sherman-Morrison formula [29], we have

(5)

Rearranging the terms in Equation (5), we have

(6)

With another application of the Sherman-Morrison formula, we have

(7)

For leaf nodes, the entry of corresponding to is set to zero, with the remaining entries equal to those of . This is a result of the minimal Euclidean norm solution of Problem 1, obtained from the pseudoinverse of . Consequently, the (signed) interaction score of a leaf equals the model evaluation on the leaf alone.

We summarize the derivation in Algorithm 1, which traverses on the parse tree from root to leaves in a top-down fashion to compute the interaction scores of each node. As the number of nodes in a parse tree is linear in the number of words, Algorithm 1 is of complexity , plus the complexity of parsing the sentence, which is in our experiments, and model evaluations. Figure 1 shows how Algorithm 1 assigns signed interaction scores to a given example.

Model .
Sentence .
LS-Tree value; interaction score.
function Detect-Interaction()
     Find the parse tree of .
     Find the collection of subsets corresponding to the parse tree.
     for each node in  do
         Compute the model evaluation for the corresponding subset .
     end for
     Compute LS-Tree value for words via least squares.
     Find the root of .
     Recursion(, , , , )
end function
Algorithm 1 LS-Tree Interaction Detection
, node , ,
function Recursion(, node , , )
     if  is not a leaf then
         Compute via Equation (7) and Equation (6).
         for each child in of  do
              Recursion(, , , , )
         end for
     else
         Assign with or .
     end if
end function
Algorithm 2 Recursion

5 Experiments

We carry out experiments to analyze the performance of four different models: Bag of Words (BoW), Word-based Convolutional Neural Network (CNN) [17]

, bidirectional Long Short-Term Memory network (LSTM)

[14], and Bidirectional Encoder Representations from Transformers (BERT) [9], across three sentiment data sets of different sizes: Stanford Sentiment Treebank (SST) [32], IMDB Movie reviews [23] and Yelp reviews Polarity [39]. For an instance with multiple sentences, we parse each sentence separately, and introduce an extra node as the common parent of all roots. Interactions between sentences are not considered in our experiments.

BoW fits a linear model on the Bag-of-Words features. Both CNN and LSTM use a 300-dimensional GloVe word embedding [25]

. The CNN is composed of three 100-dimensional convolutional 1D layers with 3, 4 and 5 kernels respectively, concatenated and fed into a max-pooling layer followed by a hidden dense layer. The LSTM uses a bidirectional LSTM layer with 128 units for each direction. BERT pre-trains a deep bidirectional Transformer

[35] on a large corpus of text by jointly conditioning on both left and right context in all layers. It has achieved state-of-the-art performance on a large suite of sentence-level and token-level tasks. See Table 1 for a summary of data sets and the accuracies of the four models.

We use the Stanford constituency parser [11, 27, 40, 41] for all the experiments. It is a transition-based parser that is faster than chart-based parsers yet achieves comparable accuracy, by employing a set of shift-reduce operations and making use of non-local features.

5.1 Deviation from linearity

We quantify the deviation of three nonlinear models from a linear model via the proposed LS-Tree value and interaction scores, both for specific instances and on a data set.

The LS-Tree value can be interpreted as supplying the coefficients of the best linear model used to approximate the target model locally for each instance. The correlation between the LS-Tree value and the global linear model with Bag of Words (BoW) features can be used as a measure of nonlinearity of the target model at the instance. Table 3 shows three examples in SST, correctly classified by both BERT and BoW. BERT has low and high correlations with linear models at the first and second examples in Table 3 respectively. In particular, the top keywords, as ranked by the LS-Tree value, are different between two models.

Data Set Classes Train Samples Test Samples Avg. Length BoW CNN LSTM BERT
SST [32] 2 6,920 872 19.7 0.82% 0.85% 0.85% 0.93%
IMDB [23] 2 25,000 25,000 325.6 0.94% 0.90% 0.88% 0.93%
Yelp [39] 2 560,000 38,000 136.2 0.94% 0.95% 0.96% 0.97%
Table 1: Statistics of the three data sets, together with the test accuracy of the four models.
BoW CNN LSTM BERT
SST 1.000 0.591 0.580 0.465
IMDB 1.000 0.442 0.552 0.321
Yelp 1.000 0.683 0.684 0.476
Table 2: Average correlation of LS-Tree values with the linear coefficients, comparable across different models on the same dataset.
Figure 2: Visualization of parse trees of examples in Table 3. Nodes are colorized based on the signed interaction scores, red for the direction of positive class, and blue otherwise.

The average of correlation with BoW across instances can be used as a measure of nonlinearity on a certain data set. The average correlation of BoW, CNN, LSTM and BERT with a linear model is shown in Table 2, which indicates that BERT is the most nonlinear model among the four. CNN is more nonlinear than LSTM on IMDB but comparably nonlinear on SST and Yelp.

BERT BoW Category Correlation Depth
Even if you do n’t think kissinger’s any more guilty of criminal activity than most contemporary statesmen, he’d sure make a courtroom trial great fun to watch. Even if you don’t think kissinger’s any more guilty of criminal activity than most contemporary statesmen, he’d sure make a courtroom trial great fun to watch. Positive 0.173 11
The problem with this film is that it lacks focus. The problem with this film is that it lacks focus. Negative 0.939 1
Funny but perilously slight. Funny but perilously slight. Positive 0.938 4
Table 3: Examples from SST correctly classified by BERT and BoW. The correlation with the linear model, and Depth of the top node are listed. Top two words ranked by the LS-Tree value, and by the linear coefficients, are colorized.
Figure 3: Average depth of top nodes as the number of the selected top nodes varies.

Correlation alone may not suffice to capture the nonlinearity of a model. For example, the third sentence in Table 3 has a relatively high correlation, but the bottom left parse tree in Figure 2 indicates that the top interaction ranked by the signed interaction score is the node combining “funny” with “but perilously slight”. This indicates the BERT model has captured the adversative conjunction, which BoW is not capable of. The ability to capture closer-to-the-top nodes in a parse tree is an indication of nonlinearity of the model. To quantify this ability, we define the depth of a node in the parse tree as the maximum distance of the node from the bottom:

For a linear model, all non-leaf nodes have zero interaction, and thus the top ranked nodes are of depth 1, until all leaves with positive weights are enumerated. The higher the depth of top-ranked nodes, the more nonlinear a model is at a specific instance.

The average depths of top nodes ranked by interaction scores across instances can be used as a measure of the nonlinearity of the model on that data set. Figure 3 compares the average depths across BoW, CNN, LSTM and BERT on the three data sets, with top words selected. BoW is used as a baseline whose non-leaf nodes have zero interaction scores. We use the absolute interaction scores here to capture all interactions, no matter in the same or opposite direction of prediction. BERT is still the most capable of capturing deeper interactions, followed by CNN and LSTM. CNN turns out to be a more nonlinear model than LSTM on Yelp, which was not captured by correlation.

Dataset Model Avg. Score not but yet though although even though whereas except despite in spite of
SST BoW 0.153 0.000(6.318) 0.000(0.079) 0.000(2.005) 0.000(0.865) 0.000(2.222) 0.000(0.000) -(-) 0.000(4.280) 0.000(3.519) 0.000(0.000)
CNN 0.634 1.673(4.592) 1.694(1.444) 0.568(0.959) 0.213(0.735) 0.915(0.462) 0.626(0.407) -(-) 0.948(1.175) 1.452(4.270) 2.119(1.943)
LSTM 0.79 1.746(2.580) 1.502(0.453) 1.449(2.368) 1.153(1.094) 0.338(0.197) 1.794(0.998) -(-) 2.353(3.835) 1.256(1.818) 0.590(0.624)
BERT 1.238 1.714(4.383) 2.148(1.760) 1.669(3.120) 1.525(3.268) 1.741(3.256) 1.885(2.092) -(-) 1.156(3.331) 1.160(2.998) 0.864(2.352)
IMDB BoW 0.038 0.000(2.683) 0.000(0.263) 0.000(2.210) 0.000(1.473) 0.000(1.710) 0.000(0.000) 0.000(3.604) 0.000(1.342) 0.000(0.132) -(-)
CNN 0.424 1.050(0.819) 3.442(0.021) 1.689(0.295) 0.922(0.085) 1.036(0.071) 1.175(0.467) 0.469(1.064) 1.590(4.067) 0.363(0.434) -(-)
LSTM 0.126 0.960(3.087) 2.222(0.524) 1.500(0.238) 0.611(0.087) 0.492(1.270) 0.944(0.683) 1.222(3.865) 1.294(4.008) 0.286(0.508) -(-)
BERT 1.159 1.616(2.057) 3.390(1.800) 1.644(1.152) 1.371(2.061) 1.735(2.123) 1.457(1.557) 0.285(0.430) 1.421(2.060) 1.518(2.241) -(-)
Yelp BoW 0.035 0.000(8.488) 0.000(1.015) 0.000(3.553) 0.000(1.664) 0.000(1.128) 0.000(0.000) 0.000(0.536) 0.000(0.367) 0.000(1.213) -(-)
CNN 0.161 2.287(3.467) 2.454(0.932) 0.516(0.043) 0.988(0.435) 0.889(0.075) 0.789(0.621) 0.286(0.671) 0.522(2.529) 0.423(0.889) -(-)
LSTM 0.224 2.173(5.950) 1.712(1.676) 0.988(2.065) 0.984(1.310) 0.706(1.194) 0.559(0.483) 1.395(1.793) 0.344(1.408) 0.514(1.153) -(-)
BERT 0.746 1.384(2.106) 2.448(0.658) 0.781(0.184) 1.336(0.953) 0.596(0.615) 1.019(0.880) 0.095(0.162) 0.331(0.074) 1.041(0.414) -(-)
Table 4: Ratio of interaction scores of specific nodes to the average interaction score of a generic node. Ratios in parentheses are for nodes with adversative words alone. Ratios without parentheses are for their parent nodes where the adversative relation takes place.
Sentence Meaning BoW CNN LSTM BERT
… He said he couldn’t help. We had to walk while the snow blew in our faces. When we were almost there, we saw the shuttle pull out with the smoking shuttle driver in it, driving in the opposite direction, away from us. I can not believe how rude they were. during the time that 0.000(0.338) 0.781(0.300) 1.761(0.839) 0.062(0.092)
… I ordered a cappuccino. It tasted like milk and no coffee. I was exceptionally disappointed. So while the place has a great reputation, even they can screw it up if they don t pay attention to detail, and at this level they should never screw it up. I had a better cup at Martys Market for crying out loud! whereas (indicating a contrast) 0.000(0.338) 1.142(0.300) 2.155(0.839) 2.167(0.092)
Usually asking the server what is her favorite dish gets you a pretty good recommendation, but in this case, it was crap! The smoked brisket had that discoloration that happens to meat when it’s been sitting out for a while. And it wasn’t even tender!! Am I asking for too much? a period of time 0.000(0.338) 0.206(0.300) 0.465(0.839) 0.082 (0.092)
Table 5: The word “while” in different contexts, together with the ratio of interaction scores of “while” nodes and their parents to the average score of a generic node. Ratios in parentheses are for nodes with “while” alone. Scores without parentheses are for their parents.

5.2 Adversative relations

Adversative words are those which express opposition or contrast. They often play an important role in determining the sentiment of an instance, by reversing the sentiment of a preceding or succeeding word, phrase or statement. We focus on four types of adversative words: negation that reverses the sentiment of a phrase or word (e.g., “not”), adversative coordinating conjunctions that express opposition or contrast between two statements (e.g., “but” and “yet”), subordinating conjunctions indicating adversative relationship (e.g., “though,” “although,” “even though,” and “whereas”), prepositions that precede and govern nouns adversatively (e.g., “except,” “despite” and “in spite of”).

In most cases, adversative words only function if they interact with their preceding or succeeding companion. In order to verify whether models are able to capture the adversative relationship, we examine the LS-Tree interaction scores of the parent nodes of these words.

We extract all instances that contain any of the above adversative words. Then for each word in an instance, we compute the interaction score of the corresponding node with the word alone, and that of its parent node. A high interaction score on the node with the adversative word alone indicates the model inappropriately attributes to the word itself a negative or positive sentiment. A high interaction score on the parent node indicates the model captures the interaction of the adversative word with its preceding or succeeding component. To compare across different models, we further compute the average interaction score of a generic node across all instances, and report the ratio of average interaction scores of specific nodes to the average score of a generic node for respective models.

Table 4 reports the results on three data sets. We observe the ability of capturing adversative relation for different models varies across data sets. BERT takes the lead in capturing adversative relations on SST and IMDB, perhaps with the help of BERT’s pre-training process on a large corpus, but CNN and LSTM catch up with BERT on Yelp, which has a larger amount of training data. On the other hand, all models assign a high score on nodes with adversative words alone. This perhaps results from the uneven distribution of adversative words like “not” among the positive and negative classes. An additional observation is that BERT has the highest score for a generic node on average across three data sets, indicating that BERT is the most sensitive to words and interactions on average.

Some words have different meanings in different contexts. It is interesting to investigate whether a model can distinguish the same word under different contexts. The word “while” is such an example. Table 5 shows three Yelp reviews that include “while.” It can be observed that the scores of the parent nodes of “while” is higher than average when “while” contains an adversative meaning, but lower otherwise. This observation holds across CNN, LSTM and BERT, with the sharpest distinction on BERT.

Figure 4: The histogram of intersection scores with train and test data for BERT on SST.
Figure 5:

The three figures in Line 1 plot training and test loss of CNN, LSTM, BERT respectively. The figures in Line 2 plot the corresponding average variance of interaction scores across instances over training and test sets. The figures in Line 3 show p-values of permutation tests of

iterations with randomly selected instances in training and test sets respectively.

5.3 Detecting overfitting

Overfitting happens when a model captures sampling noise in training data, while failing to capture underlying relationships between the inputs and outputs. Overfitting can be a problem in modern machine learning models like deep neural networks, due to their expressive nature. To mitigate overfitting, one often splits the initial training set into a training and a validation set, and uses the latter to obtain an estimate of the generalization performance [18]. This leads to a waste of training data, depriving the model of potential opportunities to learn from the labelled validation data. We observe that the LS-Tree interaction scores can be used to construct a diagnostic for overfitting, one which is solely computed with unlabelled data.

Figure 4 shows the histograms of absolute interaction scores on small subsets of training and test data of SST, for an overfitted BERT model. The scores are more spread out on test data than those on training data. In fact, we have observed this phenomenon holds true on average across instances for a overfitted model. In particular, interaction scores of test instances have a larger variance on average than those of training instances when the model is overfitted, but comparable otherwise. The observation can also be generalized to other types of neural networks, including CNN and LSTM. We show in Figure 5

the average variance on training and test sets for CNN, LSTM and BERT models against training epochs, together with the loss curves. We observe that overfitting occurs when the variances between training and test sets differ.

The observation suggests we may use the difference of average variances of interaction scores between training and test sets as a diagnostic for overfitting. In particular, a permutation test can be carried out under the null hypothesis of equal average variance. The resulting p-values are plotted against the number of training epochs in the third line of Figure 

5. It can be observed that p-values fall below the significance level of when overfitting occurs, which suggests the rejection of the null hypothesis as an early stopping criterion in training.

6 Discussion

We have proposed the LS-Tree value as a fundamental quantity for interpreting NLP models. This value leverages a constituency-based parser so that syntactic structure can play a role in determining interpretations. We have also presented an algorithm based on the LS-Tree value for detecting interactions between siblings of a parse tree. To the best of our knowledge, this is the first model-interpretation algorithm to quantify the interaction between words for arbitrary NLP models. We have applied the proposed algorithm to the problem of assessing the nonlinearity of common neural network models and the effect of adversative relations on the models. We have presented a permutation test based on the LS-Tree interaction scores as a diagnostic for overfitting.

References

  • Ba et al. [2014] Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755, 2014.
  • Bach et al. [2015] Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS One, 10(7):e0130140, 2015.
  • Baehrens et al. [2010] David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and Klaus-Robert Müller. How to explain individual classification decisions. Journal of Machine Learning Research, 11:1803–1831, 2010.
  • Banzhaf III [1964] John F Banzhaf III. Weighted voting doesn’t work: A mathematical analysis. Rutgers L. Rev., 19:317, 1964.
  • Chen et al. [2019] Jianbo Chen, Le Song, Martin J. Wainwright, and Michael I. Jordan. L-shapley and C-shapley: Efficient model interpretation for structured data. In International Conference on Learning Representations, 2019.
  • Chen et al. [2015] Kan Chen, Jiang Wang, Liang-Chieh Chen, Haoyuan Gao, Wei Xu, and Ram Nevatia. ABC-CNN: An attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960, 2015.
  • Cook [1977] R Dennis Cook. Detection of influential observation in linear regression. Technometrics, 19(1):15–18, 1977.
  • Datta et al. [2016] Anupam Datta, Shayak Sen, and Yair Zick. Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems. In Security and Privacy (SP), 2016 IEEE Symposium on, pages 598–617. IEEE, 2016.
  • Devlin et al. [2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • Godin et al. [2018] F. Godin, K. Demuynck, J. Dambre, W. De Neve, and T. Demeester. Explaining character-aware neural networks for word-level prediction: Do they discover linguistic rules? In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3275–3284. ACL, 2018.
  • Goldberg and Nivre [2012] Yoav Goldberg and Joakim Nivre. A dynamic oracle for arc-eager dependency parsing. Proceedings of COLING 2012, pages 959–976, 2012.
  • Gregor et al. [2015] Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. Draw: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623, 2015.
  • Hammer and Holzman [1992] Peter L Hammer and Ron Holzman. Approximations of pseudo-boolean functions; applications to game theory. Zeitschrift für Operations Research, 36(1):3–21, 1992.
  • Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
  • Karpathy et al. [2016] Andrej Karpathy, Justin Johnson, and Li Fei-Fei. Visualizing and understanding recurrent networks. In International Conference on Learning Representations, 2016.
  • Katsev [2011] Ilya Katsev. The least square values for games with restricted cooperation. In Game Theory and Management, page 117, 2011.
  • Kim [2014] Yoon Kim. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751. ACL, 2014.
  • Larson [1931] Selmer C Larson. The shrinkage of the coefficient of multiple correlation. Journal of Educational Psychology, 22(1):45, 1931.
  • Lei et al. [2016] Tao Lei, Regina Barzilay, and Tommi Jaakkola. Rationalizing neural predictions. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 107–117. ACL, 2016.
  • Li et al. [2016a] Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. Visualizing and understanding neural models in NLP. In Proceedings of NAACL-HLT, pages 681–691, 2016a.
  • Li et al. [2016b] Jiwei Li, Will Monroe, and Dan Jurafsky. Understanding neural networks through representation erasure. arXiv preprint arXiv:1612.08220, 2016b.
  • Lundberg and Lee [2017] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30, pages 4765–4774. Curran Associates, Inc., 2017.
  • Maas et al. [2011] Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts.

    Learning word vectors for sentiment analysis.

    In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 142–150. ACL, 2011.
  • Nowak [1997] Andrzej S Nowak. On an axiomatization of the banzhaf value without the additivity axiom. International Journal of Game Theory, 26(1):137–141, 1997.
  • Pennington et al. [2014] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543. ACL, 2014.
  • Ribeiro et al. [2016] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should I trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144. ACM, 2016.
  • Sagae and Lavie [2005] Kenji Sagae and Alon Lavie. A classifier-based parser with linear run-time complexity. In Proceedings of the Ninth International Workshop on Parsing Technology, pages 125–132. ACL, 2005.
  • Shapley [1953] Lloyd S Shapley. A value for n-person games. Contributions to the Theory of Games, 2(28):307–317, 1953.
  • Sherman and Morrison [1950] Jack Sherman and Winifred J Morrison. Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. The Annals of Mathematical Statistics, 21(1):124–127, 1950.
  • Shrikumar et al. [2017] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. In ICML, volume 70 of Proceedings of Machine Learning Research, pages 3145–3153. PMLR, 06–11 Aug 2017.
  • Simonyan et al. [2014] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In Proceedings of the International Conference on Learning Representations (ICLR), 2014.
  • Socher et al. [2013] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1631–1642. ACL, 2013.
  • Štrumbelj and Kononenko [2010] Erik Štrumbelj and Igor Kononenko. An efficient explanation of individual classifications using game theory. Journal of Machine Learning Research, 11:1–18, 2010.
  • Sundararajan et al. [2017] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In International Conference on Machine Learning, pages 3319–3328, 2017.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
  • Xu and Saenko [2016] Huijuan Xu and Kate Saenko. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In

    European Conference on Computer Vision

    , pages 451–466. Springer, 2016.
  • Xu et al. [2015] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning, pages 2048–2057, 2015.
  • Yang et al. [2016] Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked attention networks for image question answering. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 21–29, 2016.
  • Zhang et al. [2015] Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems, pages 649–657, 2015.
  • Zhang and Clark [2009] Yue Zhang and Stephen Clark. Transition-based parsing of the chinese treebank using a global discriminative model. In Proceedings of the 11th International Conference on Parsing Technologies, pages 162–171. ACL, 2009.
  • Zhu et al. [2013] Muhua Zhu, Yue Zhang, Wenliang Chen, Min Zhang, and Jingbo Zhu. Fast and accurate shift-reduce constituent parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 434–443, 2013.