Traditional rule-based NLP techniques can capture syntactic structures, while statistical NLP techniques, such as n-gram models, can heuristically integrate semantics of a natural language. Modern RNN-based models such as Long Short-Term Memory (LSTM) models are tasked with incorporating both semantic features from the statistical associations in their training corpus, and structural features generalized from the same.
Despite evidence that LSTMs can capture syntactic rules in artificial languages gers2001context, it is unclear whether they are as capable in natural languages (linzen2016assessing; lakretz2019emergence) in the context of rules such as subject-verb number agreement, especially when not supervised for the particular feature. The incongruence derives from this central question: does an LSTM language model’s apparent performance in subject-verb number agreement derive from statistical heuristics (like n-gram models) or from generalized knowledge (like rule-based models)?
Recent work has begun addressing this question linzen2016assessing in the context of language models: models tasked with modeling the likelihood of the next word following a sequence of words as expected in a natural language (see Figure 1, bottom). Subject-verb number agreement dictates that the verb associated with a given subject should match its number (e.g., in Figure 1, the verb “run” should match with the subject “boys”). Giulianelli2018 showed that the subject grammatical number is associated with various gates in an LSTM, and lakretz2019emergence showed that ablation (disabling activation) of an LSTM model at certain locations can reduce its accuracy at scoring verbs of the correct grammatical number.
Influence offers an alternate means of exploring properties like number agreement. We say an input is influential on an outcome when changing just the input and nothing else induces a change on the outcome. In English grammar, the number of a subject is influential on the number of its verb, in that changing the number of that subject while keeping all other elements of a sentence fixed would necessitate a change in the number of the verb. Algorithmic transparency literature offers formal definitions for empirically quantifying notions of influence for systems in general datta2016algorithmic and for deep neural networks specifically leino2018influence; sundararajan2017axiomatic.
The mere fact that subject number is influential on verb number as output by an LSTM model is sufficient to conclude that it incorporates the agreement concept in some way but does not indicate whether it operates as a statistical heuristic or as a generalized rule. We address this question with influence paths, which decompose influence into a set of paths across the gates and neurons of an LSTM model. The approach has several elements:
Define an input parameter to vary the concept-specific quantity under study (e.g., the grammatical number of a particular noun, bottom-left node in Figure 1) and a concept-specific output feature to measure the parameter’s effect on (e.g, number agreement with the parameterized noun, bottom-right node in Figure 1).
Apply a gradient-based influence method to quantify the influence of the concept parameter on the concept output feature; as per the chain rule, decompose the influence into model-path-specific quantities.
Inspect and characterize the distribution of influence across the model paths.
The paths demonstrate where relevant state information necessitated by the concept is kept, how it gets there, how it ends up being used to affect the model’s output, and how and where related concepts interfere.
Our approach is state-agnostic in that it does not require a priori an assumption about how or if the concept will be implemented by the LSTM. This differs from works on diagnostic classifiers where a representation of the concept is assumed to exist in the network’s latent space. The approach is also time-aware in that paths travel through cells/gates/neurons at different stages of an RNN evaluation. This differs from previous ablation-based techniques, which localize the number by clearing neurons at some position in an RNN for all time steps.
Our contributions are as follows:
We introduce influence paths, a causal account of the use of concepts of interest as carried by paths across gates and neurons of an RNN.
We demonstrate, using influence paths, that in a multi-layer LSTM language model, the concept of subject-verb number agreement is concentrated primarily on a single path (the red path in Figure 1), despite a variety of surrounding and intervening contexts.
We show that attractors (intervening nouns of opposite number to the subject) do not diminish the contribution of the primary subject-verb path, but rather contribute their own influence of the opposite direction along the equivalent primary attractor-verb path (the blue path in the figure). This can lead to incorrect number prediction if an attractor’s contribution overcomes the subject’s.
We corroborate and elaborate on existing results localizing subject number to the same two neurons which, in our results, lie on the primary path. We further extend and generalize prior compression/ablation results with a new path-focused compression test which verifies our localization conclusions.
Our results point to generalized knowledge as the answer to the central question. The number agreement concept is heavily centralized to the primary path despite the varieties of contexts. Further, the primary path’s contribution is undiminished even amongst interfering contexts; number errors are not attributable to lack of the general number concept but rather to sufficiently influential contexts pushing the result in the opposite direction.
Long short-term memory networks (LSTMs) (hochreiter1997long) have proven to be effective for modeling sequences, such as language models, and empirically, this architecture has been found to be optimal compared to other second-order RNNs (greff2017lstm). LSTMs utilize several types of gates and internal states including forget gates (), input gates (), output gates (, cell states (), candidate cell state (), and hidden states (
). Each gate is designed to carry out a certain function, or to fix a certain drawback of the vanilla RNN architecture. E.g., the forget gate is supposed to determine how much information from the previous cell state to retain or “forget”, helping to fix the vanishing gradient problem(hochreiter1998vanishing).
Number Agreement in Language Models
The number agreement (NA) task, as described by linzen2016assessing, is an evaluation of a language model’s ability to properly match the verb’s grammatical number with its subject. This evaluation is performed on sentences specifically designed for the exercise, with zero or more words between the subject and the main verb, termed the context. The task for sentences with non-empty contexts will be referred to as long-term number agreement.
“Human-level” performance for this task can be achieved with a 2-layer LSTM language model (Gulordavaa), indicating that the language model incorporates grammatical number despite being trained only for the more general word prediction task. Attempts to explain or localize the number concept within the model include (lakretz2019emergence), where ablation of neurons is applied to locate specific neurons where such information is stored; and Giulianelli2018; hupkes2018visualisation, where diagnostic classifiers are trained on gate activations to predict the number of the subject to see which gates or timesteps the number concept exhibits itself. These works also look at the special cases involving attractors—intervening nouns with grammatical number opposite to that of the subject (deemed instead helpful nouns if their number agrees with the subject)—such as the word “tree” in Figure 1. Both frameworks provide explanations as to why attractors lower the performance of NA tasks. However, they tend to focus on the activation patterns of gates or neurons without justifying their casual relationships with the concept of grammatical number, and do not explicitly identify the exact temporal trajectory of how the number of the subject influences the number of the verb.
Other relevant studies that look inside RNN models to locate specific linguistic concepts include visualization techniques such as (karpathy2015visualizing)
, and explanations for supervised tasks involving LSTMs such as sentiment analysis(murdoch2018beyond).
Attribution methods quantitatively measure the contribution of each of a function’s individual inputs to its output. Gradient-based attribution methods compute the gradient of a model with respect to its inputs to describe how important each input is towards the output predictions. These methods have been applied to assist in explaining deep neural networks, predominantly in the image domain (leino2018influence; sundararajan2017axiomatic; bach2015pixel; simonyan13saliency). Some such methods are also axiomatically justified to provide a causal link between inputs (or intermediate neurons) and the output.
As a starting point in this work, we consider Integrated Gradients (IG) (sundararajan2017axiomatic). Given a baseline, , the attribution for each input at point, , is the path integral taken from the baseline to
of the gradients of the model’s output with respect to its inputs. The baseline establishes a neutral point from which to make a counterfactual comparison; the attribution of a feature can be interpreted as the share of the model’s output that is due to that feature deviating from its baseline value. By integrating the gradients along the linear interpolation from the baseline to, IG ensures that the attribution given to each feature is sensitive to effects exhibited by the gradient at any point between the baseline and instance .
leino2018influence generalize IG to better focus attribution on concepts other than just model outputs, by use of a quantity of interest (QoI) and a distribution of interest (DoI). Their measure, Distributional Influence, is given by Definition 1. The QoI is a function of the model’s output expressing a particular output behavior of the model to calculate influence for; in IG, this is fixed as the model’s output. The DoI specifies a distribution over which the influence should faithfully summarize the model’s behavior; the influences are found by taking an expected value over DoI.
Definition 1 (Distributional Influence).
With quantity of interest, , and distribution of interest, , the influence, , of the inputs on the quantity of interest is:
The directed path integral used by IG can be implemented by setting the DoI to a uniform distribution over the line from the baseline to: , for baseline, , and then multiplying by . Conceptually, by multiplying by , we are measuring the attribution, i.e., the contribution to the QoI, of by weighting its features by their influence. We use the framework of leino2018influence in this way to define our measure of attribution for NA tasks in Section 3.
Distributional Influence can be approximated by sampling according to the DoI. In particular, when using as noted above, Definition 1 can be computationally approximated with a sum of intervals as in IG:
Other related works include fiacco2019deep, which employs the concept of neuron paths based on cofiring of neurons instead of influence, also on different NLP tasks from ours.
Our method for computing influence paths begins with modeling a relevant concept, such as grammatical number, in the influence framework of leino2018influence (Definition 1) by defining a quantity of interest that corresponds to the grammatical number of the verb, and defining a component of the input embedding that isolates the subject’s grammatical number (Section 3.1). We then decompose the influence measure along the relevant structures of LSTM (gates or neurons) as per standard calculus identities to obtain a definition for influence paths (Section 3.2).
3.1 Measuring Number Agreement
For the NA task, we view the initial fragment containing the subject as the input, and the word distribution at the position of its corresponding verb as the output.
Formally, each instance in this task is a sequence of
-dimensional word embedding vectors,, containing the subject and the corresponding verb, potentially with intervening words in between. We assume the subject is at position and the verb at position . The output score of a word, , at position will be written . If has a grammatical number, we write and to designate with its original number and the equivalent word with the opposite number, respectively.
Quantity of Interest
We instrument the output score with a QoI measuring the agreement of the output’s grammatical number to that of the subject:
Definition 2 (Number Agreement Measure).
Given a sentence, , with verb, , whose correct form (w.r.t. grammatical number) is , the quantity of interest, , measures the correctness of the grammatical number of the verb:
In plain English, captures the weight that the model assigns to the correct form of as opposed to the weight it places on the incorrect form. Note that the number agreement concept could have reasonably been measured using a different quantity of interest. E.g., considering the scores of all vocabulary words of the correct number and incorrect number in the positive and negative terms, respectively, is an another alternative. However, based on our preliminary experiments, we found this alternative does not result in meaningful changes to the reported results in the further sections.
Distribution of Interest
We also define a component of the embedding of the subject that captures its grammatical number, and a distribution over the inputs that allows us to sensitively measure the influence of this concept on our chosen quantity of interest. Let be the word embedding midway between its numbered variants, i.e., . Though this vector will typically not correspond to any English word, we interpret it as a number-neutral version of . Various works show that linear arithmetic on word embeddings of this sort preserves meaningful word semantics as demonstrated in analogy parallelograms mikolov2013distributed. Finally, given a sentence, , let be the sentence , except with the word embedding replaced with its neutral form . We see that captures the part of the input corresponding to the grammatical number of the subject, .
Definition 3 (Grammatical Number Distribution).
Given a singular (or plural) noun, , in a sentence, , the distribution density of sentences, , exercising the noun’s singularity (or plurality) linearly interpolates between the neutral sentence, , and the given sentence, :
If is singular, our counterfactual sentences span with number-neutral all the way to its singular form . We thus call this distribution a singularity distribution. Were plural instead, we would refer to the distribution as a plurality distribution. Using this distribution of sentences as our DoI thus allows us to measure the influence of (the grammatical number of a noun at position ) on our quantity of interest sensitively (in the sense that sundararajan2017axiomatic define their axiom of sensitivity for IG sundararajan2017axiomatic).
Subject-Verb Number Agreement
Putting things together, we define our attribution measure.
Definition 4 (Subject-Verb Number Agreement Attribution).
Essentially, the attribution measure weights the features of the subject’s grammatical number by their Distributional Influence, . Because is a uniform distribution over the line segment between and , as with IG, the attribution can be interpreted as each feature’s net contribution to the change in the QoI, , as (i.e., Definition 4 satisfies the axiom sundararajan2017axiomatic term completeness sundararajan2017axiomatic).
In Figure 1, for instance, this definition measures the attribution from the plurality of the subject (“boys”), towards the model’s prediction of the correctly numbered verb (“run”) versus the incorrectly numbered verb (“runs”). Later in this paper we will also investigate the attribution of intervening nouns on this same quantity. We expect the input attribution to be positive for all subjects and helpful nouns, and negative for attractors, which can be verified by the columns of Table 1 (the details of this experiment are introduced in Section 4).
3.2 Influence Paths
Input attribution as defined by IG (sundararajan2017axiomatic) provides a way of explaining a model by highlighting the input dimensions with large attribution towards the output. Distributional Influence (leino2018influence) with a carefully chosen QoI and DoI (Definition 4) further focuses the influence on a concept at hand, grammatical number agreement. Neither, however, demonstrate how these measures are conveyed by the inner workings of a model. In this section we define a decomposition of the influence into paths of a model, thereby assigning attribution not just to inputs, but also to the internal structures of a given model.
We first define arbitrary deep learning models as computational graphs, as in Definition5. We then use this graph abstraction to define a notion of influence for a path through the graph. We posit that any natural path decomposition should satisfy the following conservation property: the sum of the influence of each path from the input to the output should equal the influence of the input on the QoI. We then observe that the chain rule from calculus offers one such natural decomposition, yielding Definition 6.
Definition 5 (Model).
A model is an acyclic graph with a set of nodes, edges, and activation functions associated with
each node. The output of a node,
A model is an acyclic graph with a set of nodes, edges, and activation functions associated with each node. The output of a node,, on input is where are ’s predecessors and is its activation function. If does not have predecessors (it is an input), its activation is . We assume that the domains and ranges of all activation functions are real vectors of arbitrary dimension.
We will write to denote an edge (i.e., is a direct predecessor of ), and to denote the set of all paths from to . The partial derivative of the activation of with respect to the activation of will be written .
This view of a computation model is an extension of network decompositions from attribution methods using the natural concept of “layers” or “slices” dhamdhere2018important; leino2018influence; bach2015pixel. This decomposition can be tailored to the level of granularity we wish to expose. Moreover, in RNN models where no single and consistent “natural layer” can be found due to the variable-length inputs, a more general graph view provides the necessary versatility.
Definition 6 (Path Influence).
Expanding Definition 4 using the chain rule, the influence of input node, , on target node, , in a model, , is:
Note that the same LSTM can be modeled with different graphs to achieve a desired level of abstraction. We will use two particular levels of granularity: a coarse gate-level abstraction where nodes are LSTM gates, and a fine neuron-level abstraction where nodes are the vector elements of those gates. Though the choice of abstraction granularity has no effect on the represented model semantics, it has implications on graph paths and the scale of their individual contributions in a model.
Gate-level and Neuron-level Paths
We define the set of gate-level nodes to include: , where T is the number of time steps (words) and L is number of LSTM layers. The node set also includes an attribution-specific input node () and an output node (the QoI). An example of this is illustrated in Figure 2. We exclude intermediate calculations (the solid nodes of Figure 2, such as ) as their inclusion does not change the set of paths in a graph. We can also break down each vector node into scalar components and further decompose the gate-level model into a neuron-level one: , where is the size of each gate vector. This decomposition results in an exponentially large number of paths. However, since many functions between gates in an LSTM are element-wise operations, neuron-level connections between many neighboring gates are sparse.
While the neuron-level path decomposition can theoretically be performed on the whole network, in practice we choose to specify a gate-level path first, then further decompose that path into neuron-level paths. We also collapse selected vector nodes, allowing us to further localize a concept on a neuron level while avoiding an explosion in the number of paths. The effect of this pipeline will be empirically justified in Section 4.
In this section we apply influence path decomposition to the NA task. We investigate major gate-level paths and their influence concentrations in Section 4.2. We further show the relations between these paths and the paths carrying grammatical number from intervening nouns (i.e. attractors & helpful nouns) in Section 4.3. In both we also investigate high-attribution neurons along primary paths allowing us to compare our results to prior work.
4.1 Dataset and Model
We study the exact combination of language model and NA datasets used in the closely related prior work of lakretz2019emergence. The pre-trained language model of Gulordavaa and lakretz2019emergence is a 2-layer LSTM trained from Wikipedia articles. The number agreement datasets of lakretz2019emergence are several synthetically generated datasets varying in syntactic structures and in the number of nouns between the subject and verb.
For example, nounPP refers to sentences containing a noun subject followed by a prepositional phrase such as in Figure 1. Each NA task has subject number (and intervening noun number if present) realizations along singular (S) and plural (P) forms. In listings we denote subject number (S or P) first and additional noun (if any) number second. Details including the accuracy of the model on the NA tasks are summarized by lakretz2019emergence. Our evaluation replicates part of Table 2 in said work.
4.2 Decomposing Number Agreement
We begin with the attribution of subject number on its corresponding verb, as decomposed per Definition 6. Among all NA tasks, the gate-level path carrying the most attribution is one following the same pattern with differences only in the size of contexts. With indices and referring to the subject and verb respectively, this path, which we term the primary path of subject-verb number agreement, is as follows:
The primary path is represented by the red path in Figure 2. The influence first passes through the temporary cell state , the only non-sigmoid cell states capable of storing more information than sigmoid gates, since while the gate . Then the path passes through , , and similarly to through , jumping from the first to the second layer. The path then stays at , through the direct connections between cell states of neighbouring time steps, as though it is “stored” there without any interference from subsequent words. As a result, this path is intuitively the most efficient and simplistic way for the model to encode and store a “number bit.”
The extent to which this path can be viewed as primary is measured by two metrics. The results across a subset of syntactic structures and number conditions mirroring those in lakretz2019emergence are shown in Table 1. We include 3 representative variations of the task. The metrics are:
-value: probability that a given path has greater attribution than a uniformly sampled path on a uniformly sampled sentence.
Positive/Negative Share (Share): expected (over sentences) fraction of total positive (or negative) attribution assigned to the given positive (or negative) path.
Per Table 1 (From Subject, Primary Path), we make our first main observation:
The same one primary path consistently carries the largest amount positive attribution across all contexts as compared to all other paths.
Even in the case of its smallest share (nounPPAdv), the 3% share is large when taking into account more than 40,000 paths in total. Sentences with singular subjects (top part of Table 1) have a slightly stronger concentration of attribution in the primary path than plural subjects (bottom part of Table 1), possibly due to English plural (infinitive) verb forms occurring more frequently than singular forms, thus less concentration of attribution is needed due to the “default signal” in place.
|Task||C||From Subject||From Intervening Noun|
|Primary Path||Primary Neuron||Primary Path||Primary Neuron|
We further decompose the primary path into influence passing through each neuron. Since only connections between second layer cell states are sparse, we only decompose the segment of the primary path from to , resulting in a total of 650 (the number of hidden units) neuron-level paths. (We leave the non-sparse decompositions for future work). The path for neuron , for example, is represented as:
To compare the attribution of an individual neuron with all other neurons, we employ a similar aforementioned -value, where each neuron-level path is compared against other neuron-level paths.
The results of the neuron-level analysis are shown in Table 1 (From Subject, Primary Neuron). Out of the 650 neuron-level paths in the gate-level primary path, we discover two neurons with consistently the most attribution (neurons 125 and 337 of the second layer). This indicates the number concept is concentrated in only two neurons.
Comparison with lakretz2019emergence
Uncoincidentally, both neurons match the units found through ablation by lakretz2019emergence, who use the same model and dataset (neurons 988 and 776 are neurons 125 and 337 of the second layer). This accordance to some extent verifies that the neurons found through influence paths are functionally important. However, the -values shown in Table 1 show that both neuron 125 and 337 are influential regardless of the subject number, whereas lakretz2019emergence assign a subject number for each of these two neurons due to their disparate effect in lowering accuracy in ablation experiments. One possible reason is that the ablation mechanism used in (lakretz2019emergence) assumes that a “neutral number state” can be represented by zero-activations for all gates, while in reality the network may encode the neutral state differently for different gates.
Another major distinction of our analysis from lakretz2019emergence regards simple cases with no word between subjects and verbs. Unlike lakretz2019emergence, who claim that the two identified neurons are “long-term neurons”, we discover that these two neurons are also the only neurons important for short-term number agreement. This localization cannot be achieved by diagnostic classifiers used by lakretz2019emergence, indicating that the signal can be better uncovered using influence-based paths rather than association-based methods such as ablation.
4.3 Decomposing from Intervening Nouns
Next we focus on NA tasks with intervening nouns and make the following observation:
The primary subject-verb path still accounts for the largest positive attribution in contexts with either attractors or helpful nouns.
A slightly worse NA task performance (lakretz2019emergence) in cases of attractors (SP, PS) indicates that they interfere with prediction of the correct verb. In contrast, we also observe that helpful nouns (SS, PP) contribute positively to the correct verb number (although they should not from a grammar perspective).
Primary Path from the Intervening Noun
We adapt our number agreement concept (Definition 2) by focusing the DoI on the intervening noun, thereby allowing us to decompose its influence on the verb number not grammatically associated with it. In Table 1 (From Intervening Noun) we discover a similar primary path from the intervening noun:
Attribution towards verb number from intervening nouns follows the same primary path as the subject but is of lower magnitude and reflects either positive or negative attribution in cases of helpful nouns or attractors, respectively.
This disparity in magnitude is expected since the language model possibly identifies the subject as the head noun through the prepositions such as “behind” in Figure 1, while still needing to track the number of the intervening noun in possible clausal structures. Such need is comparably weaker compared to tracking numbers of subjects, possibly because in English, intervening clauses are rarer than intervening non-clauses. Similar arguments can be made for neuron-level paths.
4.4 Model Compression
Though the primary paths are the highest contributors to NA tasks, it is possible that collections of associated non-primary paths account for more of the verb number concept. We gauge the extent to which the primary paths alone are responsible for the concept with compression/ablation experiments. We show that the computations relevant to a specific path alone are sufficient in maintaining performance for the NA task. We compress the model by specifying node sets to preserve, and intervene on the activations of all other nodes by setting their activations to constant expected values (average over all samples). We choose the expected values instead of full ablation (setting them to zero), as ablation would nullify the function of Sigmoid gates. For example, to compress the model down to the red path in Figure 2, we only calculate the activation for gates and for each sample, while setting the activation of all other to their average values over all samples. In Table 2, we list variations of the compression schemes based on the following preserved node sets:
For example, column in Table 2 shows the accuracy when the compressed model only retains the primary path from both the subject and the intervening noun while the computations of all other paths are set to their expected values; while in , all paths but the paths in are kept.
We observe that the best compressed model is , where the primary path from the intervening noun is left out; it performs even better than the original model; the increase comes from the cases with attractors (PS, SP). This indicates that eliminating the primary path from the attractor improves the model. The next best models apart from are and , where primary paths are kept. Compressed models without the primary subject-verb path (, , ) have performances close to random guessing.
Accuracy under path-based model compression tests corroborate that primary paths account for most of the subject number agreement concept of the LSTM.
By comparing the SP and PS rows of , , , and , we observe the effect of attractors in misguiding the model into giving wrong predictions. Similarly, we see that helpful nouns (SS, PP) help guide the models to make more accurate predictions, though this is not grammatically justified.
The combination of finely-tuned attribution and gradient decomposition lets us investigate the handling of the grammatical number agreement concept attributed to paths across LSTM components. The concentration of attribution to a primary path and two primary cell state neurons and its persistence in a variety of short-term and long-term contexts, even with confounding attractors, demonstrates that the concept’s handling is, to a large degree, general and localized. Though the heuristic decisioning aspect of an LSTM is present in the large quantities of paths with non-zero influence, their overall contribution to the concept is insignificant as compared to the primary path. Node-based compression results further corroborate these conclusions.
We note, however, that our results are based on datasets exercising the agreement concept in contexts of a limited size. We speculate that the primary path’s attribution diminishes with the length of the context, which would suggest that at some context size, the handling of number will devolve to be mostly heuristic-like with no significant primary paths. Though our present datasets do not pose computational problems, the number of paths, at both the neuron and the gate level, is exponential with respect to context size. Investigating longer contexts, the diminishing dominance of the primary path, and the requisite algorithmic scalability requirements are elements of our ongoing work.
We also note that our method can be expanded to explore number agreement in more complicated sentences with clausal structures, or other syntactic/semantic signals such as coreference or gender agreement.
This work was developed with the support of NSF grant CNS-1704845 as well as by DARPA and the Air Force Research Laboratory under agreement number FA8750-15-2-0277. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes not withstanding any copyright notation thereon. The views, opinions, and/or findings expressed are those of the author(s) and should not be interpreted as representing the official views or policies of DARPA, the Air Force Research Laboratory, the National Science Foundation, or the U.S. Government. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan V GPU used for this work.