Probing for the Usage of Grammatical Number

A central quest of probing is to uncover how pre-trained models encode a linguistic property within their representations. An encoding, however, might be spurious-i.e., the model might not rely on it when making predictions. In this paper, we try to find encodings that the model actually uses, introducing a usage-based probing setup. We first choose a behavioral task which cannot be solved without using the linguistic property. Then, we attempt to remove the property by intervening on the model's representations. We contend that, if an encoding is used by the model, its removal should harm the performance on the chosen behavioral task. As a case study, we focus on how BERT encodes grammatical number, and on how it uses this encoding to solve the number agreement task. Experimentally, we find that BERT relies on a linear encoding of grammatical number to produce the correct behavioral output. We also find that BERT uses a separate encoding of grammatical number for nouns and verbs. Finally, we identify in which layers information about grammatical number is transferred from a noun to its head verb.


page 6

page 7

page 13

page 15


Open Sesame: Getting Inside BERT's Linguistic Knowledge

How and to what extent does BERT encode syntactically-sensitive hierarch...

Exploring the Role of BERT Token Representations to Explain Sentence Probing Results

Several studies have been carried out on revealing linguistic features c...

Counterfactual Interventions Reveal the Causal Effect of Relative Clause Representations on Agreement Prediction

When language models process syntactically complex sentences, do they us...

When Bert Forgets How To POS: Amnesic Probing of Linguistic Properties and MLM Predictions

A growing body of work makes use of probing in order to investigate the ...

Automated Quality Assessment of Cognitive Behavioral Therapy Sessions Through Highly Contextualized Language Representations

During a psychotherapy session, the counselor typically adopts technique...

Introducing Orthogonal Constraint in Structural Probes

With the recent success of pre-trained models in NLP, a significant focu...

Abstraction not Memory: BERT and the English Article System

Article prediction is a task that has long defied accurate linguistic de...

1 Introduction

Pre-trained language models have enabled researchers to build models that achieve impressive performance on a wide array of natural language processing (NLP) tasks

(devlin2018bert; liu; raffel2020exploring). How these models encode and use the linguistic information necessary to perform these tasks, however, remains a mystery. Over recent years, a number of works have tried to demystify the inner workings of various pre-trained language models (alain2016understanding; adi2017finegrained; elazar2021amnesic), but no comprehensive understanding of how the models work has emerged. Such analysis methods are typically termed probing, and are methodologically diverse.

In our assessment, most research in probing can be taxonomized into three distinct paradigms. In the first paradigm, diagnostic probing

, researchers typically train a supervised classifier to predict a linguistic property from the models’ representations. High accuracy is then interpreted as an indication that the representations encode information about the property

(alain2016understanding; adi2017finegrained; hupkes; conneau-etal-2018-cram). A second family of methods, behavioral probing, consists in observing a model’s behavior directly, typically studying the model’s predictions on hand-picked evaluation datasets (linzen-etal-2016-assessing; goldberg; warstadt-etal-2020-learning; ettinger-2020-bert). Finally, causal probing methods rely on interventions to evaluate how specific components impact a model’s predictions (giulianelli-etal-2018-hood; vig; elazar2021amnesic).

In this paper, we will investigate how linguistic properties are encoded in a model’s representations, where we use the term encoding to mean the subspace on which a model relies to extract—or decode—the information. While probing has been extensively used to investigate whether a linguistic property is encoded in a set of representations, it still cannot definitively answer whether a model actually uses a certain encoding. Diagnostic probes, for instance, may pick up on a spurious encoding of a linguistic property, i.e., an encoding that allows us to extract our target property from the representation, but which the model being probed may not actually use to make a prediction.

Combining the three paradigms above, we instead seek to find encodings that are actually used by a pre-trained model, which we term functional encodings. To that end, we take a usage-based perspective on probing. Under this perspective, a researcher first identifies a linguistic property to investigate (e.g., grammatical number), and selects a behavioral task which requires knowledge of this property (e.g., selecting a verb’s inflection which agrees in number with its subject). The researcher then performs a causal intervention with the goal of removing a specific encoding (of the linguistic property under consideration) from the model’s representations. If the encoding is a functional encoding, i.e., an encoding that the model indeed uses to make a prediction, then the intervention should prevent the model from solving the task.

Finally, once a functional encoding is discovered, we can use it to track how the property’s information flows through the model under investigation.

As a case study, we examine how BERT devlin2018bert uses grammatical number to solve a number agreement task. In English, grammatical number is a binary morpho-syntactic property: A word is plural or singular. In turn, subject–verb number agreement is a behavioral task; it inspects whether a model can predict the correct verbal inflection given its subject’s number. For a model to solve the task, it thus requires information about the grammatical number of the subject and the verb. Our goal is to find how the model encodes this information when using it to make predictions. In other words, we want to find the structure from which the model decodes number information when solving the task.

In our experiments, we make three findings. First, our experiments provide us with strong evidence that BERT relies on a linear functional encoding of grammatical number to solve the number agreement task. Second, we find that nouns and verbs do not have a shared functional encoding of number; in fact, BERT relies on disjoint sub-spaces to extract their information. Third, our usage-based perspective allows us to identify where number information (again, as used by our model to make predictions) is transferred from a noun to its head verb. Specifically, we find that this transfer occurs between BERT’s and layers, and that most of this information is passed indirectly through other tokens in the sentence.

2 Paradigms in Probing

A variety of approaches to probing have been proposed in the literature. In this paper, we taxonomize them into three paradigms: (i) diagnostic probing, (ii) behavioral probing, and (iii) causal probing.

Diagnostic Probing.

Traditionally, probing papers focus on training supervised models on top of fixed pre-trained representations (adi2017finegrained; maudslay-etal-2020-tale). The general assumption behind the work is that, if a probe achieves high accuracy, then the property of interest is encoded in the representations. Many researchers have expressed a preference for linear classifiers in probing alain2016understanding; ettinger-etal-2016-probing; hewitt-manning-2019-structural, suggesting that a less complex classifier gives us more insight into the model. Others, however, called this criterion into question (tenney2018what; tenney-etal-2019-bert; voita-titov-2020-information; papadimitriou-etal-2021-deep; sinha2021masked; pimentel-etal-2020-pareto; pimentel-cotterell-2021-bayesian). Notably, hewitt-liang-2019-designing proposed that complex classifiers may learn to extract a property by themselves, and may thus not reflect any true pattern in the representations. Further, pimentel-etal-2020-information showed that, under a weak assumption, contextual representations encode as much information as the original sentences. Ergo, it is not clear what we can conclude from diagnostic probing alone.

Behavioral Probing.

Another probing paradigm analyzes the behavior of pre-trained models on carefully curated datasets. By avoiding the use of diagnostic probes, they do not fall prey to the criticism above—tasks are directly performed by the model, and thus must reflect the pre-trained models’ acuity. One notable example is linzen-etal-2016-assessing, who evaluate a language model’s syntactic ability via a careful analysis of a number agreement task. By controlling the evaluation, linzen-etal-2016-assessing

could disentangle the model’s syntactic knowledge from a heuristic based on linear ordering. In a similar vein, a host of recent work makes use of carefully designed test sets to perform behavioral analysis

(ribeiro-etal-2020-beyond; warstadt-etal-2020-learning; warstadt2020can; lovering2021predicting; newman2021refining). While behavioral probing often yields useful insights, the paradigm typically treats the model itself as a blackbox, thus failing to explain how individual components of the model work.

Causal Probing.

Finally, a third probing paradigm relies on causal interventions (vig; tucker-etal-2021-modified; ravfogel-etal-2021-counterfactual)

. In short, the researcher performs causal interventions that modify parts of the network during a forward pass (e.g., a layer’s hidden representations) to determine their function. For example,


fix a neuron’s value while manipulating the model’s input to evaluate this neuron’s role in mediating gender bias. Relatedly,

elazar2021amnesic propose a method to erase a target property from a model’s intermediate layers. They then analyze the effect of such interventions on a masked language model’s outputs.

3 Probing for Usage

Under our usage-based perspective, our goal is to find a functional encoding—i.e., an encoding that the model actually uses when making predictions. We achieve this by relying on a combination of the paradigms discussed in § 2. To this end, we first need a behavioral task that requires the model to use information about the target property. We then perform a causal intervention to try to remove this property’s encoding. We explain both these components in more detail now.

Behavioral Task.

We first require a behavioral task which can only be solved with information about the target property. The choice of task and target property are thus co-dependent. Further, we require our model to perform well on this task. On one hand, if the model cannot achieve high performance on the behavioral task, we cannot be sure the model encodes the target property, e.g., grammatical number, at all. On the other hand, if the model can perform the task, it must make use of the property.

Causal Intervention.

Our goal in this work is to answer a causal question: Can we identify a property’s functional encoding? We thus require a way to intervene in the model’s representations. If a model relies on an encoding to make predictions, removing it should harm the model’s performance on the behavioral task. If follows that, by measuring the impact of our interventions on the model’s behavioral output, we can assess whether our model was indeed decoding information from our targeted encoding.

4 Grammatical Number and its Usage

The empirical portion of this paper focuses on a study of how BERT encodes grammatical number in English. We choose number as our object of study because it is a well understood morpho-syntactic property in English. Thus, we are able to formulate simple hypotheses about how BERT passes information about number when performing number agreement. We use linzen-etal-2016-assessing’s (linzen-etal-2016-assessing) number agreement task as our behavioral task.

4.1 The Number Agreement Task

In English, a verb and its subject agree in grammatical number (CorbettGG2006A). Consider, for instance, the sentences:

. . *The boy goes to the movies. .̱ *The boy go to the movies. .̧ *The boy that holds the keys goes to the movies. .̣ *The boy that holds the keys go to the movies.

In the above sentences, both 4.1 and 4.1 are grammatical, but 4.1 and 4.1 are not; this is because, in the latter two sentences, the highlighted verb does not agree in number with its subject.

The subject–verb number agreement task evaluates a model’s ability to predict the correct verbal inflection, measuring its preference for the grammatical sentence. In this task, the probed model is typically asked to predict the verb’s number given its context. The model is then considered successful if it assigns a larger probability to the correct verb inflection:

  • [leftmargin=2pt,itemsep=1pt]

  • context: The boy that holds the keys [MASK] to the movies.

  • success:

  • failure:

In this setting, the subject is usually called the cue of the agreement, and the verb is called the target.

Examples similar to the above are often designed to study the impact of distractors (the word keys in 4.1 and 4.1) on the model’s ability to predict the correct verb form. Success on the task is usually taken as evidence that a model is able to track syntactic dependencies. In this regard, this phenomena has been studied in a variety of settings to investigate the syntactic abilities of neural language models (gulordava-etal-2018-colorless; marvin-linzen-2018-targeted; newman2021refining; lasri). In this work, however, we do not use this task to make claims about the syntactic abilities of the model, as done by linzen-etal-2016-assessing. Instead, we employ it as a case study to investigate how BERT encodes and uses grammatical number.

4.2 Related Work on Grammatical Number

A number of studies have investigated how grammatical number is encoded in neural language models.111We focus on grammatical number here. There is, however, also a vast literature investigating how BERT encodes number from a numeracy point of view (wallace-etal-2019-nlp; geva-etal-2020-injecting; spithourakis-riedel-2018-numeracy). Most of this work, however, focuses on diagnostic probes (klafka-ettinger-2020-spying; torroba-hennigen-etal-2020-intrinsic). These studies are thus agnostic about whether the probed models actually use the encodings of number they discover. Some authors, however, do consider the relationship between how the model encodes grammatical number and its predictions. Notedly, giulianelli-etal-2018-hood use a diagnostic probe to investigate how an LSTM encodes number in a subject–verb number agreement setting. Other approaches (lakretz-etal-2019-emergence; finlayson-etal-2021-causal) have been proposed to apply interventions at the neuron level and track their effect on number agreement. In this work, we look for functional encodings of grammatical number—encodings which are in fact used by our probed model when solving the task.

5 From Encoding to Usage

We discuss how to identify and remove an encoding from a set of contextual representations using diagnostic probing. Our use of diagnostic probing is thus twofold. For a model to rely on an encoding of our property when making predictions, the property must be encoded in its representations. We thus first use diagnostic probing to measure the amount of information a representation contains about the target linguistic property. In this sense, diagnostic probing serves to sanity-check our experiments—if we cannot extract information from the representations, there is no point in going forward with our analysis. Second, we make use of diagnostic probing in the context of amnesic probing (elazar2021amnesic), which allows us to determine whether this probe finds a functional or a spurious encoding of the target property.

5.1 Estimating Extractable Information

In this section, we discuss how to estimate the amount of extractable number information in our probed model’s representations. This is the probing perspective taken by

pimentel-etal-2020-information and hewitt-etal-2021-conditional in their diagnostic probing analyses. The crux of our analysis relies on the fact that the encoding extracted by diagnostic probes is not necessarily the functional encoding used by our probed model. Nevertheless, for a model to use a property in its predictions, this property should at least be extractable, which is true due to the data processing inequality. In other words, extractability is a necessary, but not sufficient, condition for a property to be used by the model.

We quantify the amount of extractable information in a set of representations in terms of a diagnostic probe’s -information (xu2020theory), where the

-information is a direct measure of the amount of extractable information in a random variable. We compute the

-information as:222 See App. B for a detailed description of -information.


where and are, respectively, a representation-valued and a number-valued random variables, is a variation family determined by our diagnostic probe, and the -entropies are defined as:


Further, if we denote our analyzed model’s (i.e., BERT’s) hidden representations as:


we define our linear diagnostic probe as:


where , is a sentence position and is a layer, is the binary number label associated with the word at position ,

is the sigmoid function,

is a real-valued column parameter vector and

is a bias term. In this case, we can define our variational family as .

5.2 Intervening on the Representations

We now discuss how we perform a causal intervention to prevent the analyzed model from using a given encoding. The goal is to damage the model and make it “forget” a property’s information. This allows us to analyze whether that encoding actually influences the probed model’s predictions—i.e., whether this encoding is indeed functional. To this end, we employ amnesic probing (elazar2021amnesic).333In particular, this intervention consists in applying iterative null-space projection to the representations, originally proposed by ravfogel-etal-2020-null. We note that ravfogel2022linear; ravfogel2022adversarial recently proposed two new methods to remove information from a set of representations. In short, we first learn a linear diagnostic classifier, following eq. 5

. We then compute the projector onto the kernel (or null) space of this linear transform

, shown below:


By iterating this process, we store a set of parameter vectors and their associated projectors until we are unable to extract the property. The composition of these projectors makes it possible to remove all linearly extractable number information from the analyzed representations. We can then apply the resulting composition to the said representations to get a new set of vectors:


After learning the projectors, we can measure how erasing a layer’s encoding impacts: (i) the subsequent layers, and (ii) our model’s performance on the number agreement task. Removing a functional encoding of grammatical number should cause a performance drop on the number agreement task. Further, looking at both (i) and (ii) allows us to make a connection between the amount of information we can extract from our probed model’s layers and its behavior. We are thus able to determine whether the encodings revealed by our diagnostic probes are valid from a usage-based perspective—are they actually used by the probed model on a task that requires them?444Our method differs from amnesic probing mostly in that all our analyses are based on a behavioral task which we know a priori to require the property we investigate.

6 Experimental Setup


We perform our analysis on linzen-etal-2016-assessing’s (linzen-etal-2016-assessing) number agreement dataset, which consists in sentences extracted from Wikipedia. In this dataset, each sentence has been labeled with the position of the cue and target, along with their grammatical number. We assume here that this dataset is representative of the number agreement task; this may not be true in general, however.


In our experiments, we probe BERT (devlin2018bert).555We focus on bert-base-uncased, as implemented in the transformers library (wolf-etal-2020-transformers). Specifically, BERT is a bidirectional transformer model with 12 layers, trained using a masked language modeling objective. As BERT has been shown to perform well on this dataset (goldberg), we already know that our probed model passes our first requirement; BERT does use number information in its predictions.

Distinguishing Nouns and Verbs.

While number is a morpho-syntactic property common to nouns and verbs, we do not know a priori if BERT relies on a single subspace to encode number in their representations. Though it is possible for BERT to use the same encoding, it is equally plausible that each part of speech would get its own number encoding. This leads us to perform our analyses using independent sets of representations for nouns and verbs; as well as a mixed set which merges both of them. Further, verbs are masked when performing the number agreement task, so their representations differ from those of unmasked verbs. Ergo, we analyze both unmasked, and masked tokens at the target verb’s position—which for simplicity we call verbs and masked verbs, respectively. This leaves us with four probed categories: nouns, verbs, masked verbs, and mixed.

Figure 1: The amount of -information BERT representations hold about grammatical number, as estimated with linear diagnostic probes.
Figure 2: Cosine similarities between the learned parameter vectors of our diagnostic probes. The matrices display similarities between different layers for a given word category (top), and across categories (bottom).

7 Experiments and Results

In our experiments, we focus on answering two questions: (i) How is number information encoded in BERT’s representations? and (ii) How is number information transferred from a noun to its head verb for the model to use it on the behavioral task? We answer question (i) under both extractability and usage-based perspectives. In § 7.1, we present our sanity-check experiments that demonstrate that grammatical number is indeed linearly extractable from BERT’s representations. In § 7.2 and § 7.3, we use our causal interventions: we identify BERT’s functional encodings of number; and analyze whether these functional encodings are shared across parts of speech. Finally, in § 7.4 and § 7.5 we investigate question (ii), taking a closer look at the layers in which information is passed.

7.1 What do diagnostic probes say about number?

Fig. 1 presents diagnostic probing results in all four of our analyzed settings.666 We further present accuracy results in App. A. A priori, we expect that verbs’ and nouns’ representations should already contain a large amount of -information about their grammatical number at the type-level. As expected, we see that the -information is near its maximum for both verbs and nouns in all layers; this means that nearly 100% of the uncertainty about grammatical number is eliminated given BERT’s representations. Further, the mixed category results also reach a maximal -information, which indicates that it is possible to extract information linearly about both categories at the same time. On the other hand, the -information of masked verbs is 0 at the non-contextual layer and it progressively grows as we get to the upper layers.777 We note that, in Fig. 1, layer 0 corresponds to the non-contextual representations (i.e. the word embeddings before being summed to BERT’s position embeddings). Non-contextual layers thus contain no information about the number of a masked verb, as the mask token contains no information about its replaced verb’s number. As we go to BERT’s deeper layers, the -information steadily rises, with nearly all of the original uncertainty eliminated in the mid layers. This suggests that masked verbs’ representations acquire number information in the first 7 layers.

However, from these results alone we cannot confirm whether the encoding that nouns and verbs use for number is shared or disjoint. We thus inspect the encoding found by our diagnostic probes, evaluating the cosine similarity between their learned parameters (ignoring the probes’ bias terms here). If there is a single shared encoding across categories, these cosine similarities should be high. If not, they should be roughly zero. Fig. 2 (left) shows that nouns and verbs might encode number along different directions. Specifically, noun representations on the first 6 layers seem to have a rather opposite encoding from verbs, while the later layers are mostly orthogonal. Further, while masked verbs and verbs do not seem to share an encoding in the first few layers, they are strongly aligned from layer 6 on (Fig. 2; center).

We now know that there are encodings from which we can extract number from nouns and verbs, and that these encodings are disjoint. However, we still do not know whether the encoding is spurious or functional.

[ Information loss (measured at the target) after erasing nouns’ number information at the cue position.]                                                                                [ NA performance drop after erasing number at the cue position.]

[ Information loss (measured at the target) after erasing masked verbs’ number at the target position.]                                                                                [ NA performance drop after erasing number at the target position.]

Figure 3: Effect of our causal interventions on information recovery in subsequent layers (triangular matrices) and on the number agreement task (bar charts). Information loss is measured at the target position by a diagnostic probe; we display the probing accuracy drop compared to when no intervention was performed. The legend in the bar charts indicates what category the amnesic projectors have been trained on. Majority represents the difference in performance between BERT and a trivial baseline which always guesses the majority label.

7.2 Does the model use these encodings?

The patterns previously observed suggest there is a linear encoding, from which grammatical number can be extracted from BERT’s representations. We, however, cannot determine whether these encodings are actually those used by the model to make predictions. We now answer this question taking our proposed usage-based perspective, studying the impact of linearly removing number information at both the cue and target positions.888The number of dimensions removed by our amnesic projectors in each layer and category is presented in Tab. 1. We evaluate the model’s change in behavior, as evaluated by its performance on the number agreement (NA) task.

Fig. 3 and Fig. 3 show the decrease in how much information is extractable at the target position after the interventions are applied. Fig. 3 and Fig. 3 show BERT’s accuracy drops on the NA task (as measured at the output level). By comparing these results, we find a strong alignment between the information lost across layers and the damage caused to the performance on the task—irreversible information losses resulting from our intervention are mirrored by a performance decrease on the NA task. This alignment confirms that the model indeed uses the linear information erased by our probes. In other words, we have found the probed property’s functional encoding.

7.3 Does BERT use the same encoding for verbs and nouns?

We now return to the question of whether nouns and verbs share a functional encoding of number, or whether BERT encodes number differently for them. To answer this question, we investigate the impact of removing a category’s encoding from another category, e.g. applying an amnesic projector learned on verbs to a noun. In particular, we measure how these interventions decrease BERT’s performance in our behavioral task. Figs. 3 and 3 presents these results.

We observe that each category’s projector has a different impact on performance depending on whether it is applied to the cue or the target. Fig. 3, for instance, shows that using the verb’s, or masked verb’s, projector to erase information at the cue’s (i.e., the noun’s) position does not hurt the model. It is similarly unimpactful (as shown in Fig. 3) to use the noun’s projectors to erase a target’s (i.e., the masked verb’s) number information. Further, the projector learned on the mixed set of representations does affect the cue, but has little effect on the target. Together, these results confirm that BERT relies on rather distinct encodings of number information for nouns and verbs.999 A potential criticism of amnesic probing is that it may remove more information than necessary. Cross-testing our amnesic probes, however, results in little effect on BERT’s behavior. It is thus likely that they are not overly harming our model. Further, we also run a control experiment proposed by elazar2021amnesic, removing random directions at each layer (instead of the ones found by our amnesic probes). These results are displayed in the appendix in Tab. 1.

These experiments allow us to make stronger claims about BERT’s encoding of number information. First, the fact that our interventions have a direct impact on BERT’s behavioral output confirms that the encoding we erase actually bears number information as used by the model when making predictions. Second, the observation from Fig. 2—that number information could be encoded orthogonally for nouns and verbs—is confirmed from a usage-based perspective. Indeed, using amnesic probes trained on nouns has no impact when applied to masked verbs, and amnesic probes trained on verbs have no impact when applied to nouns. These fine-grained differences in encoding may affect larger-scale probing studies if one’s goal is to understand the inner functioning of a model. Together, these results invite us to employ diagnostic probes more carefully, as the encoding found may not be actually used by the model.

7.4 Where does number erasure affect the model?

Once we have found which encoding the model uses, we can pinpoint at which layers the information is passed from the cue to the target. To that end, we observe how interventions applied in each layer affect performance. We know number information must be passed from the cue to the target’s representations—otherwise the model cannot solve the task. Therefore, applying causal interventions to remove number information should harm the model’s behavioral performance when applied to: (i) the cue’s representations before the transfer occurs; (ii) the target’s representations after the transfer occurred.

Interestingly, we observe that target interventions are only harmful after the 9 layer; while noun interventions only hurt up to the 8 layer (again, shown in Fig. 3). This suggests that the cue passes its number information in the first 8 layers, and that the target stops acquiring number information in the last three layers. While we see a clear stop in the transfer of information after layer 8, Fig. 3 shows that the previous layers’ contribution decreases slowly up to that layer. We thus conclude that information is passed in the layers before layer 8; however, we concede that our analysis alone makes it difficult to pinpoint exactly which layers.

7.5 Where does attention pruning affect number transfer?

Finally, in our last experiments, we complement our analysis by performing attention removal to investigate how and where information is transmitted from the cue to the target position. This causal intervention first serves the purpose of identifying the layers where information is transmitted. Further, we wish to understand whether information is passed directly, or through intermediary tokens. To this end, we look at the effect on NA performance after: (i) cutting direct attention from the target to the cue at specific layers, (ii) cutting attention from all tokens to the cue (as information could be first passed to intermediate tokens, which the target could attend to in subsequent layers).101010klafka-ettinger-2020-spying, for instance, showed that number information of a given token was distributed to neighboring tokens in the upper layers Specifically, we perform these interventions in ranges of layers (from layer up to ). We report number agreement accuracy drops in Fig. 4.111111We detail these interventions in App. C.

The diagonals from this figure show that removing attention from a single layer has basically no effect. Further, cutting attention from layers 6 to 10 suffices to observe near-maximal effect for direct attention. Interestingly, it is at those layers where we see a transition from it being more harmful to apply amnesic projectors to the cue or to the target (in § 7.4). However, while those layers play a role in carrying number information to the target position, the drop is relatively modest when cutting only direct attention (). Cutting attention from all tokens to the cue, in turn, has a significant effect on performance (up to ), and is maximal for layers 2 to 8. This first suggests that, while other clues in the sentence could indicate the target verb’s number (such as a noun’s determiner), the noun itself is the core source of number information. Further, this shows the target can get information from intermediate tokens, instead of number being passed exclusively through direct attention.121212See App. E for further experiments.

(a) Removing attention from the target to the cue only
(b) Removing attention from all tokens to the cue
Figure 4: Number agreement task performance drops after performing attention removal. The attention cut is performed on a range of layers. Rows and columns, respectively, represent the first and last intervened layer.

8 Discussion

8.1 Information Extractability and Usage

Following the maxim that correlation is not causation, carefully designed analyses should be performed to give strong evidence for a property being encoded in a model’s representations. We have shown that using diagnostic probes uncover encodings that might not necessarily be useful to the model’s predictions. Indeed, we show that BERT decodes grammatical number from orthogonal subspaces for nouns and verbs—even though simple linear classifiers can separate a population of mixed vectors. This in turn raises the question of whether complexity alone, a feature much discussed in the literature (pimentel-etal-2020-pareto; voita-titov-2020-information), is enough to evaluate probes, as finding a simple encoding is not enough evidence that the encoding is actually useful to the model.

8.2 From Linguistic Properties to Encoding

Using a pipeline similar to ours, ravfogel-etal-2021-counterfactual recently investigated whether a model was solving the number agreement task in a manner that is linguistically plausible. In this paper, we have shown that even a relatively simple property’s encoding (i.e., grammatical number’s) can hide subtleties which only surface after carrying out a fine-grained analysis. Indeed, despite the fact that number is a single morpho-syntactic property common to nouns and verbs, we show that BERT uses separate representations for each category. This fine-grained difference in representation informs us that one should be cautious when choosing a property to probe a given model. Indeed the model could be representing the latter in a subtler way than what a researcher would initially expect for.

8.3 Understanding BERT’s Inner Workings

Throughout this work, our results allow us to identify how number is encoded by our model, and where it is transferred across token positions—as confirmed by behavioral observations. Our results point towards number information being transmitted from cue to target up to the layer. Our results also reveal that information transfer does not result from direct attention only, which confirms previous observations that information is distributed across neighboring tokens in the sentence (klafka-ettinger-2020-spying).

It is not easy to dissect the inner mechanisms which allow large pre-trained models to acquire their impressive abilities. However, identifying how information is encoded and where it is transferred across layers reduces the scope of where to look for answers. Further, with more reliable accounts of the encoding structures used by a model when decoding a property, we might be able to operationalize a larger set of probing questions. Given a better understanding of how BERT structures number information, for instance, we can now try to ask how it identifies the subject a verb should get it from.131313wei-etal-2021-frequency’s (wei-etal-2021-frequency) causal interventions on the training data, for instance, could be interesting for such an analysis.

9 Conclusion

Our analysis of grammatical number allows us to track how a simple morpho-syntactic property, grammatical number, is encoded across BERT’s layers and where it is transferred between them before being used on the model’s predictions. Using carefully chosen causal interventions, we demonstrate that forgetting number information impacts both: (i) BERT’s behavior and (ii) how much information is extractable from BERT’s inner layers. Further, the effects of our interventions on these two, i.e., behavior and information extractability, line up satisfyingly, and reveal the encoding of number to be orthogonal for nouns and verbs. This finding is surprising given that number is a linguistic property common to both part-of-speech. Finally, we are also able to identify the layers in which the transfer of information occurs, and find that the information is not passed directly but through intermediate tokens. Our ability to concretely evaluate our interventions’ impact is due to our focus on grammatical number and the number agreement task—which directly align probed information and behavioral performance.

Ethics Statement

The authors foresee no ethical concerns with the work presented in this paper.


We thank Josef Valvoda, the anonymous reviewers, and the meta-reviewer, for their invaluable feedback in improving this paper. Karim Lasri’s work is funded by the French government under management of Agence Nationale de la Recherche as part of the “Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute). Ryan Cotterell acknowledges support from the Swiss National Science Foundation (SNSF) as part of the “The Forgotten Role of Inductive Bias in Interpretability” project. Tiago Pimentel is supported by a Meta Research PhD Fellowship.


Appendix A Diagnostic Probing Cross-Evaluation

In addition to comparing the angles of our diagnostic probes trained on different categories, we performed cross-evaluation of our trained diagnostic probes. In this setting, we trained probes on one category and tested them on the others. Fig. 5 presents our cross-evaluation results. The performance of probes evaluated in one category, but trained on another, again suggests that BERT encodes number differently across lexical categories. Interestingly, in the lower layer, the probe tested on nouns (top-left) guesses the wrong number systematically when trained on verbs, and vice-versa (top-right). This can be due to token ambiguity, as some singular nouns (e.g. “hit") are also plural verbs. This is further evidence that the encoding might be different for nouns and verbs, though this analysis still cannot tell us whether this is true from our usage-based perspective. Additionally, the mixed results (Fig. 5; bottom-right), show it is possible to linearly separate both nouns and verbs with a single linear classifier trained on both categories, reaching perfect performance on all other categories, including masked-verbs (bottom-left).

Figure 5: Probes cross-evaluation. Each plot corresponds to a test category, and colors correspond to the category used for training. Solid lines represent the percentage of majority-class (plural vs singular) tokens; dashed lines represent the percentage of majority-class tokens per lemma, averaged across lemmas.

Appendix B -information and mutual information

While a probing classifier’s performance is often measured with accuracy metrics, in their analysis, pimentel-etal-2020-information defined probing as extracting a mutual information. Formally, we write


where and are, respectively, a representation-valued and a number-valued random variables. The mutual information, however, is a mostly theoretical value—hard to approximate in practice.

To compute this, we must first define a variational family of interest; which we define as the set of linear transformations representable by eq. 5. We can then compute the -information as:


where -entropies are defined as:


This -information can vary in the range ; thus a more interpretable value is the -uncertainty, which we define here as:


We note that the -information lower-bounds the mutual information: . It follows that, if we can extract some -information from a set of representations, they contain at least the same amount of information in shannon1948mathematical’s (shannon1948mathematical) more classic sense.

Appendix C Attention Intervention

Formally, let be a model’s attention weights for a given layer , a head , and a sentence with length .141414Our analyzed model, BERT base, has 12 layers, and 12 attention heads in each layer. Further, we define a binary mask matrix . We can now perform an intervention by masking the attention weights of all heads in a layer. Given a layer :


where represents an elementwise product between two matrices. Now assume a given sentence with cue position , and with target position . In our intervention (i), matrix is set to all 1’s except for ; the target’s attention to the cue is thus set to zero. In intervention (ii), we set and other positions to 1, which removes all attention to the cue.

Appendix D Removing random directions from representations

Removing directions from intermediate spaces could harm the model’s normal functioning independently from removing our targeted property. We thus run a control experiment proposed by elazar2021amnesic, removing random directions at each layer (as opposed to the specific directions found by our amnesic probes). This experiment allows us to verify that the observed information loss and decrease in performance do not only result from removing too many directions. To do so, we remove an equal number of random directions at each layer. The results are displayed in Tab. 1 and show that removing randomly chosen directions has little to no effect compared to our targeted causal interventions.

Layer 0 1 2 3 4 5 6 7 8 9 10 11 12
Masked Verbs
    Number of Directions 1 13 15 26 30 17 21 44 24 22 22 26 33
    Loss in Layers 0.0 0.33 0.3 0.34 0.34 0.34 0.37 0.38 0.42 0.39 0.41 0.41 0.41
    Loss in Layers (Random) 0.08 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
    NA Performance Drop 0.04 0.01 0.01 0.01 0.01 0.0 0.01 0.0 0.0 0.09 0.29 0.33 0.23
    NA Performance Drop (Random) 0.03 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.01 0.01
    Number of Directions 17 51 33 70 22 37 48 52 64 39 22 39 26
    Loss in Layers 0.49 0.37 0.39 0.38 0.37 0.38 0.43 0.4 0.43 0.4 0.37 0.41 0.4
    Loss in Layers (Random) 0.0 0.0 0.0 0.02 0.01 0.0 0.0 0.0 0.0 0.0 0.0 0.01 0.0
    NA Performance Drop 0.32 0.32 0.27 0.29 0.28 0.29 0.22 0.09 0.04 0.0 0.0 0.0 0.0
    NA Performance Drop (Random) 0.06 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Table 1: Causal intervention results using both the default or random directions. For each category, we display the number of directions removed in each layer, the information loss resulting from amnesic interventions in each layer and the effect on the NA task. We also display the loss in layers and performance decrease on NA resulting from the removal of random directions as a control experiment.
(a) Cutting attention from the target to the cue only
(b) Cutting attention from all tokens to the cue
Figure 6: Agreement task performance drops resulting from attention interventions, as a function of linear distance between the cue and the target. The rows represent distances (from 1 to 15) and columns represent the intervened layers. Three conditions are tested: cutting attention only at current layer (left), cutting attention starting from current layer up to the last one (middle) and from the first layer to current layer (right). The color map on the far right represent agreement scores without intervention for each linear distance.

Appendix E The effect of linear distance

Here, we test whether the linear distance between the cue and the target influences the effect of attention removal. Fig. 5(a) shows that cutting attention from one layer has negligible effect over performance regardless of distance, which is in line with results from the diagonals of Fig. 4. When cutting attention from several subsequent layers (Fig. 5(b)), we observe that performance drop depends on the linear position, and decreases when the model is not faced with short-range agreement. This is not surprising as many of the attention maps attend to surrounding tokens (kovaleva-etal-2019-revealing). Extensive analysis targeting individual attention heads (instead of cutting all attention from a given layer) is necessary to examine both their contribution to the model’s successes, and their dependence on linear distance.