Log In Sign Up

A Latent-Variable Model for Intrinsic Probing

by   Karolina Stańczak, et al.

The success of pre-trained contextualized representations has prompted researchers to analyze them for the presence of linguistic information. Indeed, it is natural to assume that these pre-trained representations do encode some level of linguistic knowledge as they have brought about large empirical improvements on a wide variety of NLP tasks, which suggests they are learning true linguistic generalization. In this work, we focus on intrinsic probing, an analysis technique where the goal is not only to identify whether a representation encodes a linguistic attribute, but also to pinpoint where this attribute is encoded. We propose a novel latent-variable formulation for constructing intrinsic probes and derive a tractable variational approximation to the log-likelihood. Our results show that our model is versatile and yields tighter mutual information estimates than two intrinsic probes previously proposed in the literature. Finally, we find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.


page 1

page 15


Intrinsic Probing through Dimension Selection

Most modern NLP systems make use of pre-trained contextual representatio...

A Simple Geometric Method for Cross-Lingual Linguistic Transformations with Pre-trained Autoencoders

Powerful sentence encoders trained for multiple languages are on the ris...

Probing as Quantifying the Inductive Bias of Pre-trained Representations

Pre-trained contextual representations have led to dramatic performance ...

DirectProbe: Studying Representations without Classifiers

Understanding how linguistic structures are encoded in contextualized em...

Information-Theoretic Probing for Linguistic Structure

The success of neural networks on a diverse set of NLP tasks has led res...

GILBO: One Metric to Measure Them All

We propose a simple, tractable lower bound on the mutual information con...

Investigating Cross-Linguistic Adjective Ordering Tendencies with a Latent-Variable Model

Across languages, multiple consecutive adjectives modifying a noun (e.g....

1 Introduction

There have been considerable improvements to the quality of pre-trained contextualized representations in recent years (e.g., Peters et al., 2018; Devlin et al., 2019; Raffel et al., 2020). These advances have sparked an interest in understanding what linguistic information may be lurking within the representations themselves (Poliak et al., 2018; Zhang & Bowman, 2018; Rogers et al., 2020, inter alia). One philosophy that has been proposed to extract this information is called probing

, the task of training an external classifier to predict the linguistic property of interest directly from the representations. The hope of probing is that it sheds light onto how much linguistic knowledge is present in representations and, perhaps, how that information is structured. Probing has grown to be a fruitful area of research, with researchers probing for morphological 

(Tang et al., 2020; Ács et al., 2021), syntactic (Voita & Titov, 2020; Hall Maudslay et al., 2020; Ács et al., 2021), and semantic (Vulić et al., 2020; Tang et al., 2020) information.

Figure 1: The percentage overlap between the top-30 most informative number dimensions in BERT for the probed languages. Statistically significant overlap, after Holm–Bonferroni family-wise error correction (Holm, 1979), with , is marked with an orange square.

In this paper, we focus on one type of probing known as intrinsic probing (Dalvi et al., 2019; Torroba Hennigen et al., 2020), a subset of which specifically aims to ascertain how information is structured within a representation. This means that we are not solely interested in determining whether a network encodes the tense of a verb, but also in pinpointing exactly whichneurons in the network are responsible for encoding the property. Unfortunately, the naïve formulation of intrinsic probing requires one to analyze all possible combinations of neurons, which is intractable even for the smallest representations used in modern-day NLP. For example, analyzing all combinations of 768-dimensional BERT word representations would require us to train different probes, one for each combination of neurons, which far exceeds the estimated number of atoms in the observable universe.

To obviate this difficulty, we introduce a novel latent-variable probe for discriminative intrinsic probing. The core idea of this approach is that instead of training a different probe for each combination of neurons, we introduce a subset-valued latent variable. We approximately marginalize over the latent subsets using variational inference. Training the probe in this manner results in a set of parameters which work well across all possible subsets. We propose two variational families to model the posterior over the latent subset-valued random variables, both based on common sampling designs: Poisson sampling, which selects each neuron based on independent Bernoulli trials, and conditional Poisson sampling, which first samples a fixed number of neurons from a uniform distribution and then a subset of neurons of that size

(Lohr, 2019). Conditional Poisson sampling offers the modeler more control over the distribution over subset sizes; they may pick the parametric distribution themselves.

We compare both variants to the two main intrinsic probing approaches we are aware of in the literature (§ 5). To do so, we train probes for 29 morphosyntactic properties across 6 languages (English, Portuguese, Polish, Russian, Arabic, and Finnish) from the Universal Dependencies (UD; Nivre et al. 2017) treebanks. We show that, in general, both variants of our method yield tighter estimates of the mutual information, though the model based on conditional Poisson sampling yields slightly better performance. This suggests that they are better at quantifying the informational content encoded in m-BERT representations (Devlin et al., 2019). We make two typological findings when applying our probe. We show that there is a difference in how information is structured depending on the language with certain language–attribute pairs requiring more dimensions to encode relevant information. We also analyze whether neural representations are able to learn cross-lingual abstractions from multilingual corpora. We confirm this statement and observe a strong overlap in the most informative dimensions, especially for number (Fig. 1). In an additional experiment, we show that our method supports training deeper probes (§ 5.4), though the advantages of non-linear probes over their linear counterparts are modest.

2 Intrinsic Probing

The success behind pre-trained contextual representations such as BERT (Devlin et al., 2019)

suggests that they may offer a continuous analogue of the discrete structures in language, such as morphosyntactic attributes number, case, or tense. Intrinsic probing aims to recognize the parts of a network (assuming they exist) which encode such structures. In this paper, we will operate exclusively at the level of the neuron—in the case of BERT, this is one component of the 768-dimensional vector the model outputs. However, our approach can easily generalize to other settings, e.g., the layers in a transformer or filters of a convolutional neural network. Identifying individual neurons responsible for encoding linguistic features of interest has previously been shown to increase model transparency 

(Bau et al., 2019). In fact, knowledge about which neurons encode certain properties has also been employed to mitigate potential biases (Vig et al., 2020)

, for controllable text generation 

(Bau et al., 2019), and to analyze the linguistic capabilities of language models (Lakretz et al., 2019).

To formally describe our intrinsic probing framework, we first introduce some notation. We define to be the set of values that some property of interest can take, e.g., for the morphosyntactic number attribute. Let be a dataset of label–representation pairs: is a linguistic property and is a representation. Additionally, let be the set of all neurons in a representation; in our setup, it is an integer range. In the case of BERT, we have . Given a subset of dimensions , we write for the subvector of which contains only the dimensions present in .

Let be a probe—a classifier trained to predict from a subvector . In intrinsic probing, our goal is to find the size subset of neurons

which are most informative about the property of interest. This may be written as the following combinatorial optimization problem 

(Torroba Hennigen et al., 2020):


To exhaustively solve Eq. 1, we would have to train a probe for every one of the exponentially many subsets of size . Thus, exactly solving eq. 1 is infeasible, and we are forced to rely on an approximate solution, e.g., greedily selecting the dimension that maximizes the objective. However, greedy selection alone is not enough to make solving eq. 1 manageable; because we must retrain for every subset considered during the greedy selection procedure, i.e., we would end up training classifiers. As an example, consider what would happen if one used a greedy selection scheme to find the 50 most informative dimensions for a property on 768-dimensional BERT representations. To select the first dimension, one would need to train 768 probes. To select the second dimension, one would train an additional 767, and so forth. After 50 dimensions, one would have trained 37893 probes. To address this problem, our paper introduces a latent-variable probe, which identifies a that can be used for any combination of neurons under consideration allowing a greedy selection procedure to work in practice.

3 A Latent-Variable Probe

The technical contribution of this work is a novel latent-variable model for intrinsic probing. Our method starts with a generic probabilistic probe which predicts a linguistic attribute given a subset of the hidden dimensions; is then used to subset into . To avoid training a unique probe for every possible subset , we propose to integrate a prior over subsets into the model and then to marginalize out all possible subsets of neurons:


Due to this marginalization, our likelihood is not dependent on any specific subset of neurons . Throughout this paper we opted for a non-informative, uniform prior , but other distributions are also possible.

Our goal is to estimate the parameters . We achieve this by maximizing the log-likelihood of the training data with respect to the parameters . Unfortunately, directly computing this involves a sum over all possible subsets of —a sum with an exponential number of summands. Thus, we resort to a variational approximation. Let be a distribution over subsets, parameterized by parameters ; we will use to approximate the true posterior distribution. Then, the log-likelihood is lower-bounded as follows:


which follows from Jensen’s inequality, where is the entropy of .111See App. A for the full derivation.

Our likelihood is general, and can take the form of any objective function. This means that we can use this approach to train intrinsic probes with any type of architecture amenable to gradient-based optimization, e.g., neural networks. However, in this paper, we use a linear classifier unless stated otherwise. Further, note that

eq. 3 is valid for any choice of . We explore two variational families for , each based on a common sampling technique. The first (herein Poisson) applies Poisson sampling (Hájek, 1964), which assumes each neuron to be subjected to an independent Bernoulli trial. The second one (Conditional Poisson; Aires, 1999) corresponds to conditional Poisson sampling, which can be defined as conditioning a Poisson sample by a fixed sample size.

3.1 Parameter Estimation

As mentioned above, exact computation of the log-likelihood is intractable due to the sum over all possible subsets of . Thus, we optimize the variational bound presented in eq. 3

. We optimize the bound through stochastic gradient descent with respect to the model parameters

and the variational parameters , a technique known as stochastic variational inference (Hoffman et al., 2013). However, one final trick is necessary, since the variational bound still includes a sum over all subsets in the first term:


where we take Monte Carlo samples to approximate the sum. In the case of the gradient with respect to , we also have to apply the REINFORCE trick  (Williams, 1992):


where we again take

Monte Carlo samples. This procedure leads to an unbiased estimate of the gradient of the variational approximation.

3.2 Choice of Variational Family .

We consider two choices of variational family , both based on sampling designs (Lohr, 2019). Each defines a parameterized distribution over all subsets of .

Poisson Sampling.

Poisson sampling is one of the simplest sampling designs. In our setting, each neuron is given a unique non-negative weight . This gives us the following parameterized distribution over subsets:


The formulation in Eq. 6 shows that taking a sample corresponds to

independent coin flips—one for each neuron—where the probability of heads is

. The entropy of a Poisson sampling may be computed in time:


where . The gradient of eq. 7

may be computed automatically through backpropagation. Poisson sampling automatically modules the size of the sampled set

and we have the expected size .

Conditional Poisson Sampling.

We also consider a variational family that factors as follows:


In this paper, we take , but a more complex distribution, e.g., a Categorical, could be learned. We define as a conditional Poisson sampling design. Similarly to Poisson sampling, conditional Poisson sampling starts with a unique positive weight associated with every neuron . However, an additional cardinality constraint is introduced. This leads to the following distribution


A more elaborate dynamic program which runs in may be used to compute efficiently (Aires, 1999). We may further compute the entropy and its the gradient in time using the expectation semiring (Eisner, 2002; Li & Eisner, 2009). Sampling from can be done efficiently using quantities computed when running the dynamic program used to compute  (Kulesza, 2012).222In practice, we use the semiring implementations by Rush (2020).

4 Experimental Setup

Our setup is virtually identical to the morphosyntactic probing setup of Torroba Hennigen et al. (2020). This consists of first automatically mapping treebanks from UD v2.1 (Nivre et al., 2017) to the UniMorph (McCarthy et al., 2018) schema.333We use the code available at: Then, we compute multilingual BERT (m-BERT) representations444We use the implementation by Wolf et al. (2020). for every sentence in the UD treebanks. After computing the m-BERT representations for the entire sentence, we extract representations for individual words in the sentence and pair them with the UniMorph morphosyntactic annotations. We estimate our probes’ parameters using the UD training set and conduct greedy selection to approximate the objective in Eq. 1 on the validation set; finally, we report the results on the test set, i.e., we test whether the set of neurons we found on the development set generalizes to held-out data. Additionally, we discard values that occur fewer than 20 times across splits. When feeding as input to our probes, we set any dimensions that are not present in to zero. We select as the number of Monte Carlo samples since we found this to work adequately in small-scale experiments. We compare the performance of the probes on 29 different language–attribute pairs (listed in App. B).

Since the performance of a probe on a specific subset of dimensions is related to both the subset itself (e.g., whether it is informative or not) and the number of dimensions being evaluated (e.g., if a probe is trained to expect 768 dimensions as input, it might work best when few or no dimensions are filled with zeros), we sample 100 subsets of dimensions with 5 different possible sizes (we considered 10, 50, 100, 250, 500 dim.) and compare every model’s performance on each of those subset sizes. Further details about training and hyperparameter settings are provided in

App. C.

4.1 Baselines

We compare our latent-variable probe against two other recently proposed intrinsic probing methods as baselines.

  • Torroba Hennigen et al. (2020):

    Our first baseline is a generative probe that models the joint distribution of representations and their properties

    , where the representation distribution is assumed to be Gaussian. Torroba Hennigen et al. (2020)

    report that a major limitation of this probe is that if certain dimensions of the representations are not distributed according to a Gaussian distribution, then probe performance will suffer.

  • Dalvi et al. (2019): Our second baseline is a linear classifier, where dimensions not under consideration are zeroed out during evaluation (Dalvi et al., 2019; Durrani et al., 2020).555We note that they do not conduct intrinsic probing via dimension selection: Instead, they use the absolute magnitude of the weights as a proxy for dimension importance. In this paper, we adopt the approach of (Torroba Hennigen et al., 2020) and use the performance-based objective in eq. 1. Their approach is a special case of our proposed latent-variable model, where is fixed, so that on every training iteration the entire set of dimensions is sampled.

Additionally, we compare our methods to a naïve approach, a probe that is re-trained for every set of dimensions under consideration selecting the dimension that maximizes the objective (herein Upper Bound).666The Upper Bound yields the tightest estimate on the mutual information, however as mentioned in § 2, this is unfeasible since it requires retraining for every different combination of neurons. For comparison, in English number, on an Nvidia RTX 2070 GPU, our Poisson, Gaussian and Linear experiments take a few minutes or even seconds to run, compared to Upper Bound which takes multiple hours. Due to computational cost, we limit our comparisons with Upper Bound to 6 randomly chosen morphosyntactic attributes,777English–Number, Portuguese–Gender and Noun Class, Polish–Tense, Russian–Voice, Arabic–Case, Finnish–Tense each in a different language.

4.2 Metrics

We compare our proposed method to the baselines above under two metrics: accuracy and mutual information (MI). We report mutual information, which has recently been proposed as an evaluation metric for probes

(Pimentel et al., 2020). Here, mutual information (MI) is a function between a -valued random variable and a -valued random variable over masked representations:


where is the inherent entropy of the property being probed and is constant with respect to ; is the entropy over the property given the representations . Exact computation of the mutual information is intractable; however, we can lower-bound the MI by approximating using our probe’s average negative log-likelihood: on held-out data. See Brown et al. (1992) for a derivation.

We normalize the mutual information (NMI) by dividing the MI by the entropy which turns it into a percentage and is, arguably, more interpretable. We refer the reader to Gates et al. (2019) for a discussion of the normalization of MI.

We also report accuracy which is a standard measure for evaluating probes as it is for evaluating classifiers in general. However, accuracy can be a misleading measure especially on imbalanced datasets since it considers solely correct predictions.

4.3 What Makes a Good Probe?

Since we report a lower bound on the mutual information (§ 4), we deem the best probe to be the one that yields the tightest mutual information estimate, or, in other words, the one that achieves the highest mutual information estimate; this is equivalent to having the best cross-entropy on held-out data, which is the standard evaluation metric for language modeling.

However, in the context of intrinsic probing, the topic of primary interest is what the probe reveals about the structure of the representations. For instance, does the probe reveal that the information encoded in the embeddings is focalized or dispersed across many neurons? Several prior works (e.g., Lakretz et al., 2019) focus on the single neuron setting, which is a special, very focal case. To engage with this prior work, we compare probes not only with respect to their performance (MI and accuracy), but also with respect to the size of the subset of dimensions being evaluated, i.e., the size of set .

We acknowledge that there is a disparity between the quantitative evaluation we employ, in which probes are compared based on their MI estimates, and qualitative nature of intrinsic probing, which aims to identify the substructures of a model that encode a property of interest. However, it is non-trivial to evaluate fundamentally qualitative procedures in a large-scale, systematic, and unbiased manner. Therefore, we rely on the quantitative evaluation metrics presented in § 4.2, while also qualitatively inspecting the implications of our probes.

5 Results

In this section, we present the results of our empirical investigation. First, we address our main research question: Does our latent-variable probe presented in §3 outperform previously proposed intrinsic probing methods (§5.1)? Second, we analyze the structure of the most informative m-BERT neurons for the different morphosyntactic attributes we probe for (§5.2). Finally, we investigate whether knowledge about morphosyntax encoded in neural representations is shared across languages (§5.3). In §5.4, we show that our latent-variable probe is flexible enough to support deep neural probes.

5.1 How Do Our Methods Perform?

The main question we ask is how the performance of our models compares to existing intrinsic probing approaches. To investigate this research question, we compare the performance of the Poisson and Conditional Poisson probes to Linear (Dalvi et al., 2019) and Gaussian (Torroba Hennigen et al., 2020). Refer to § 4.3 for a discussion of the limitations of our method. author=lucas,color=dandelion!60,size=,fancyline,caption=,author=lucas,color=dandelion!60,size=,fancyline,caption=,todo: author=lucas,color=dandelion!60,size=,fancyline,caption=,Eventually (but not now), we should add a footnote that reads “In concurrent work, Antverg & Belinkov delve into the limitations of our approach.”

Number of dimensions
C. Poisson 0.50 0.58 0.70 0.99 1.00
Poisson 0.21 0.49 0.66 0.98 1.00
C. Poisson 0.99 1.00 1.00 1.00 0.98
Poisson 0.95 0.99 1.00 1.00 0.97
Table 1: Proportion of experiments where Conditional Poisson (C. Poisson) and Poisson beat the benchmark models Linear and Gaussian in terms of NMI. For each of the subset sizes, we sampled 100 different subsets of BERT dimensions at random.
Cond. Poisson
Cond. Poisson
Upper Bound
Table 2:

Mean and standard deviation of NMI for the

Poisson, Conditional Poisson, Linear (Dalvi et al., 2019) and Gaussian (Torroba Hennigen et al., 2020) probes for all language–attribute pairs (top) and mean NMI and standard deviation for the Conditional Poisson, Poisson and Upper Bound for 6 selected language–attribute pairs (bottom). For each subset size considered, we take our averages over 100 randomly sampled subsets of BERT dimensions.

Figure 2: Comparison of the Poisson, Conditional Poisson, Linear (Dalvi et al., 2019) and Gaussian (Torroba Hennigen et al., 2020) probes. We use the greedy selection approach in Eq. 1 to select the most informative dimensions, and average across all language–attribute pairs we probe for.

In general, Conditional Poisson tends to outperform Poisson at lower dimensions, however, Poisson tends to catch up as more dimensions are added. Our results suggest that both variants of our latent-variable model from § 3 are effective and generally outperform the Linear baseline as shown in Tab. 1. The Gaussian baseline tends to perform similarly to Conditional Poisson when we consider subsets of 10 dimensions, and it outperforms Poisson substantially. However, for subsets of size or more, both Conditional Poisson and Poisson are preferable. We believe that the robust performance of Gaussian in the low-dimensional regimen can be attributed to its ability to model non-linear decision boundaries (Murphy, 2012, Chapter 4).

The trends above are corroborated by a comparison of the mean NMI (Tab. 2, top) achieved by each of these probes for different subset sizes. However, in terms of accuracy (see Tab. 3 in App. D), while both Conditional Poisson and Poisson generally outperform Linear, Gaussian tends to achieve higher accuracy than our methods. Notwithstanding, Gaussian’s performance (in terms of NMI) is not stable and can yield low or even negative mutual information estimates across all subsets of dimensions. Adding a new dimension can never decrease the mutual information, so the observable decreases occur because the generative model deteriorates upon adding another dimension, which validates Torroba Hennigen et al.’s claim that some dimensions are not adequately modeled by the Gaussian assumption. While these results suggest that Gaussian may be preferable if performing a comparison based on accuracy, the instability of Gaussian when considering NMI suggests that this edge in terms of accuracy comes at a hefty cost in terms of calibration (Guo et al., 2017).888While accuracy only cares about whether predictions are correct, NMI penalizes miscalibrated predictions. This is the case because it is proportional to the negative log likelihood (Guo et al., 2017). author=lucas,color=dandelion!60,size=,fancyline,caption=,author=lucas,color=dandelion!60,size=,fancyline,caption=,todo: author=lucas,color=dandelion!60,size=,fancyline,caption=,We have to be careful here; our accuracy table doesn’t show that clear of an improvement. We should probably elaborate on,color=orange!10,size=,fancyline,caption=,author=karolina,color=orange!10,size=,fancyline,caption=,todo: author=karolina,color=orange!10,size=,fancyline,caption=,Tried to rewrite it a bit, what do you think?author=lucas,color=dandelion!60,size=,fancyline,caption=,author=lucas,color=dandelion!60,size=,fancyline,caption=,todo: author=lucas,color=dandelion!60,size=,fancyline,caption=,I think we need to address this head-on, especially since NMI is a “strange” metric compared to accuracy. What do you think?

Further, we compare the Poisson and Conditional Poisson probes to the Upper Bound baseline. This is expected to be the highest performing since it is re-trained for every subset under consideration and indeed, this assumption is confirmed by the results in Tab. 2 (bottom). The difference between our probes’ performance and the Upper Bound baseline’s performance can be seen as the cost of sharing parameters across all subsets of dimensions, and an effective intrinsic probe should minimize this.

We also conduct a direct comparison of Linear, Gaussian, Poisson and Conditional Poisson when used to identify the most informative subsets of dimensions. The average MI reported by each model across all 29 morphosyntactic language–attribute pairs is presented in Fig. 2. On average, Conditional Poisson offers comparable performance to Gaussian at low dimensionalities for both NMI and accuracy, though the latter tends to yield a slightly higher (and thus a tighter) bound on the mutual information. However, as more dimensions are taken into consideration, our models vastly outperform Gaussian. Poisson and Conditional Poisson perform comparably at high dimensions, but Conditional Poisson performs slightly better for 1–20 dimensions. Poisson outperforms Linear at high dimensions, and Conditional Poisson outperforms Linear for all dimensions considered. These effects are less pronounced for accuracy, which we believe to be due to accuracy’s insensitivity to a probe’s confidence in its prediction.

5.2 Information Distribution

We compare performance of the Conditional Poisson

probe for each attribute for all available languages in order to better understand the relatively high NMI variance across results (see

Tab. 2). In Fig. 3 we plot the average NMI for genderauthor=lucas,color=dandelion!60,size=,fancyline,caption=,author=lucas,color=dandelion!60,size=,fancyline,caption=,todo: author=lucas,color=dandelion!60,size=,fancyline,caption=,Hmmm.. should we do this for the greedy method instead? The current interpretation is that any subset of dimensions will encode less information about gender if there are more,color=orange!10,size=,fancyline,caption=,author=karolina,color=orange!10,size=,fancyline,caption=,todo: author=karolina,color=orange!10,size=,fancyline,caption=,Yes, already changed to greedy selection. Need to check what is the reason for the peak around 3-4 dimensions. and observe that languages with two genders present (Arabic and Portuguese) achieve higher performance than languages with three genders (Russian and Polish) which is an intuitive result due to increased task complexity. Further, we see that the slopes for both Russian and Polish are flatter, especially at lower dimensions. This implies that the information for Russian and Polish is more dispersed and more dimensions are needed to capture the typological information.

Figure 3: Comparison of the average NMI for gender dimensions in BERT for each of the available languages. We use the greedy selection approach in Eq. 1 to select the most informative dimensions, and average across all language–attribute pairs we probe for.

5.3 Cross-Lingual Overlap

We compare the most informative m-BERT dimensions recovered by our probe across languages and find that, in many cases, the same set of neurons may express the same morphosyntactic phenomena across languages. For example, we find that Russian, Polish, Portuguese, English and Arabic all have statistically significant overlap in the top-30 most informative neurons for number (Fig. 1). Similarly, we observe presence of statistically significant overlap for gender (Fig. 5, left). This effect is particularly strong between Russian and Polish, where we additionally find statistically significant overlap between top-30 neurons for case (Fig. 5, right). These results might indicate that BERT may be leveraging data from other languages to develop a cross-lingually entangled notion of morpho-syntax (Torroba Hennigen et al., 2020), and that this effect that may be particularly strong between typologically similar languages.999In concurrent work, Antverg & Belinkov (2021) find evidence supporting a similar phenomenon.

5.4 How Do Deeper Probes Perform?

Figure 4: Comparison of a linear Conditional Poisson probe to non-linear MLP-1 and MLP-2 Conditional Poisson probes for selected language-attribute pairs. For each of the subset sizes shown on the -axis, we sampled 100 different subsets of BERT dimensions at random.

Multiple papers have promoted the use of linear probes (Tenney et al., 2018; Liu et al., 2019), in part because they are ostensibly less likely to memorize patterns in the data (Zhang & Bowman, 2018; Hewitt & Liang, 2019), though this is subject to debate (Voita & Titov, 2020; Pimentel et al., 2020). Here we verify our claim from § 3 that our probe can be applied to any kind of discriminative probe architecture as our objective function can be optimized using gradient descent.

We follow the setup of Hewitt & Liang (2019), and test MLP-1 and MLP-2 Conditional Poisson probes alongside a linear Conditional Poisson

probe. The MLP-1 and MLP-2 probes are multilayer perceptrons (MLP) with one and two hidden layer(s), respectively, and Rectified Linear Unit 

(ReLU; Nair & Hinton,

activation function.

In Fig. 4, we can see that our method not only works well for deeper probes, but also outperforms the linear probe in terms of NMI. However, at higher dimensionalities, the advantage of a deeper probe diminishes. We also find that the difference in performance between MLP-1 and MLP-2 is negligible.

6 Related Work

A growing interest in interpretability has led to a flurry of work in trying to assess exactly what pre-trained representations know about language. To this end, diverse methods have been employed, such as the construction of specific challenge sets that seek to evaluate how well representations model particular phenomena (Linzen et al., 2016; Gulordava et al., 2018; Goldberg, 2019; Goodwin et al., 2020), and visualization methods (Kádár et al., 2017; Rethmeier et al., 2020). Work on probing comprises a major share of this endeavor (Belinkov & Glass, 2019; Belinkov, 2021). This has taken the form of both focused studies on particular linguistic phenomena (e.g., subject-verb number agreement, Giulianelli et al., 2018) to broad assessments of contextual representations in a wide array of tasks (Şahin et al., 2020; Tenney et al., 2018; Conneau et al., 2018; Liu et al., 2019; Ravichander et al., 2021, inter alia).

Efforts have ranged widely, but most of these focus on extrinsic rather than intrinsic probing. Most work on the latter has focused primarily on ascribing roles to individual neurons through methods such as visualization (Karpathy et al., 2015; Li et al., 2016a) and ablation (Li et al., 2016b). For example, recently Lakretz et al. (2019)

conduct an in-depth study of how long–short-term memory networks 

(LSTMs; Hochreiter & Schmidhuber, 1997) capture subject–verb number agreement, and identify two units largely responsible for this phenomenon.

More recently, there has been a growing interest in extending intrinsic probing to collections of neurons. Bau et al. (2019) utilize unsupervised methods to identify important neurons, and then attempt to control a neural network’s outputs by selectively modifying them. Bau et al. (2020)

pursue a similar goal in a computer vision setting, but ascribe meaning to neurons based on how their activations correlate with particular classifications in images, and are able to control these manually with interpretable results. Aiming to answer questions on interpretability in computer vision and natural language inference,

Mu & Andreas (2020) develop a method to create compositional explanations of individual neurons and investigate abstractions encoded in them. Vig et al. (2020) analyze how information related to gender and societal biases is encoded in individual neurons and how it is being propagated through different model,color=dandelion!60,size=,fancyline,caption=,author=lucas,color=dandelion!60,size=,fancyline,caption=,todo: author=lucas,color=dandelion!60,size=,fancyline,caption=,Karolina: Shortened this last sentence a bit, does it still hold?

7 Conclusion

In this paper, we introduce a new method for training discriminative intrinsic probes that can perform well across any subset of dimensions. To do so, we train a probing classifier with a subset-valued latent variable and demonstrate how the latent subsets can be marginalized using variational inference. We propose two variational families, based on common sampling designs, to model the posterior over subsets: Poisson sampling and conditional Poisson sampling. We demonstrate that both variants outperform our baselines in terms of mutual information, and that using a conditional Poisson variational family generally gives optimal performance. Further, we investigate information distribution for each attribute for all available languages. Finally, we find empirical evidence for overlap in the specific neurons used to encode morphosyntactic properties across languages.


  • Ács et al. (2021) Ács, J., Kádár, Á., and Kornai, A. Subword pooling makes a difference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 2284–2295, Online, April 2021. Association for Computational Linguistics. URL
  • Aires (1999) Aires, N. Algorithms to find exact inclusion probabilities for conditional Poisson sampling and Pareto s sampling designs. Methodology And Computing In Applied Probability, 1(4):457–469, Dec 1999. ISSN 1573-7713. doi: 10.1023/A:1010091628740. URL
  • Antverg & Belinkov (2021) Antverg, O. and Belinkov, Y. On the pitfalls of analyzing individual neurons in language models. arXiv:2110.07483 [cs.CL], 2021.
  • Bau et al. (2019) Bau, A., Belinkov, Y., Sajjad, H., Durrani, N., Dalvi, F., and Glass, J. Identifying and controlling important neurons in neural machine translation. In International Conference on Learning Representations, 2019. URL
  • Bau et al. (2020) Bau, D., Zhu, J.-Y., Strobelt, H., Lapedriza, A., Zhou, B., and Torralba, A. Understanding the role of individual units in a deep neural network. Proceedings of the National Academy of Sciences, September 2020. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.1907375117. URL
  • Belinkov (2021) Belinkov, Y. Probing classifiers: Promises, shortcomings, and alternatives. arXiv preprint arXiv:2102.12452, 2021. URL
  • Belinkov & Glass (2019) Belinkov, Y. and Glass, J. Analysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics, 7:49–72, March 2019. doi: 10.1162/tacl_a_00254. URL
  • Brown et al. (1992) Brown, P. F., Della Pietra, S. A., Della Pietra, V. J., Lai, J. C., and Mercer, R. L. An estimate of an upper bound for the entropy of English. Computational Linguistics, 18(1):31–40, 1992. URL
  • Conneau et al. (2018) Conneau, A., Kruszewski, G., Lample, G., Barrault, L., and Baroni, M. What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2126–2136, Melbourne, Australia, 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1198. URL
  • Dalvi et al. (2019) Dalvi, F., Durrani, N., Sajjad, H., Belinkov, Y., Bau, A., and Glass, J. What is one grain of sand in the desert? Analyzing individual neurons in deep NLP models.

    Proceedings of the AAAI Conference on Artificial Intelligence

    , 33:6309–6317, July 2019.
    ISSN 2374-3468, 2159-5399. doi: 10.1609/aaai.v33i01.33016309. URL
  • Devlin et al. (2019) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL
  • Durrani et al. (2020) Durrani, N., Sajjad, H., Dalvi, F., and Belinkov, Y. Analyzing individual neurons in pre-trained language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4865–4880, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.395. URL
  • Eisner (2002) Eisner, J. Parameter estimation for probabilistic finite-state transducers. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 1–8, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073085. URL
  • Gates et al. (2019) Gates, A. J., Wood, I. B., Hetrick, W. P., and Ahn, Y.-Y. Element-centric clustering comparison unifies overlaps and hierarchy. Scientific Reports, 9(1):8574, June 2019. ISSN 2045-2322. doi: 10.1038/s41598-019-44892-y. URL
  • Giulianelli et al. (2018) Giulianelli, M., Harding, J., Mohnert, F., Hupkes, D., and Zuidema, W. Under the hood: Using diagnostic classifiers to investigate and improve how language models track agreement information. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 240–248, Brussels, Belgium, 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-5426. URL
  • Goldberg (2019) Goldberg, Y. Assessing BERT’s syntactic abilities. arXiv:1901.05287 [cs], January 2019.
  • Goodwin et al. (2020) Goodwin, E., Sinha, K., and O’Donnell, T. J. Probing linguistic systematicity. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1958–1969, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.177. URL
  • Gulordava et al. (2018) Gulordava, K., Bojanowski, P., Grave, E., Linzen, T., and Baroni, M. Colorless green recurrent networks dream hierarchically. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1195–1205, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1108. URL
  • Guo et al. (2017) Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp. 1321–1330., Aug 2017. URL
  • Hájek (1964) Hájek, J. Asymptotic theory of rejective sampling with varying probabilities from a finite population. The Annals of Mathematical Statistics, 35(4):1491–1523, 1964. ISSN 0003-4851. URL
  • Hall Maudslay et al. (2020) Hall Maudslay, R., Valvoda, J., Pimentel, T., Williams, A., and Cotterell, R. A tale of a probe and a parser. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7389–7395, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.659. URL
  • Hewitt & Liang (2019) Hewitt, J. and Liang, P. Designing and interpreting probes with control tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2733–2743, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1275. URL
  • Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long Short-Term Memory. Neural Computation, 9(8):1735–1780, November 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735. URL
  • Hoffman et al. (2013) Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. Stochastic variational inference. Journal of Machine Learning Research, 14(4):1303–1347, 2013. ISSN 1533-7928. URL
  • Holm (1979) Holm, S. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2):65–70, 1979. ISSN 0303-6898. URL
  • Kádár et al. (2017) Kádár, Á., Chrupała, G., and Alishahi, A.

    Representation of linguistic form and function in recurrent neural networks.

    Computational Linguistics, 43(4):761–780, December 2017. doi: 10.1162/COLI_a_00300. URL
  • Karpathy et al. (2015) Karpathy, A., Johnson, J., and Fei-Fei, L. Visualizing and understanding recurrent networks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Workshop Proceedings, November 2015. URL
  • Kingma & Ba (2015) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, San Diego, CA, May 2015. URL
  • Kulesza (2012) Kulesza, A. Determinantal point processes for machine learning. Foundations and Trends in Machine Learning, 5(2-3):123–286, 2012. ISSN 1935-8245. doi: 10.1561/2200000044. URL
  • Lakretz et al. (2019) Lakretz, Y., Kruszewski, G., Desbordes, T., Hupkes, D., Dehaene, S., and Baroni, M. The emergence of number and syntax units in LSTM language models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 11–20, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1002. URL
  • Li et al. (2016a) Li, J., Chen, X., Hovy, E., and Jurafsky, D. Visualizing and understanding neural models in NLP. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 681–691, San Diego, California, 2016a. Association for Computational Linguistics. doi: 10.18653/v1/N16-1082. URL
  • Li et al. (2016b) Li, J., Monroe, W., and Jurafsky, D. Understanding neural networks through representation erasure. CoRR, abs/1612.08220, 2016b. URL
  • Li & Eisner (2009) Li, Z. and Eisner, J. First- and second-order expectation semirings with applications to minimum-risk training on translation forests. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 40–51, Singapore, August 2009. Association for Computational Linguistics. URL
  • Linzen et al. (2016) Linzen, T., Dupoux, E., and Goldberg, Y. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics, 4:521–535, 2016. doi: 10.1162/tacl_a_00115. URL
  • Liu et al. (2019) Liu, N. F., Gardner, M., Belinkov, Y., Peters, M. E., and Smith, N. A. Linguistic knowledge and transferability of contextual representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1073–1094, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1112. URL
  • Lohr (2019) Lohr, S. L. Sampling: Design and Analysis. CRC Press, 2 edition, 2019. URL
  • McCarthy et al. (2018) McCarthy, A. D., Silfverberg, M., Cotterell, R., Hulden, M., and Yarowsky, D. Marrying universal dependencies and universal morphology. In Proceedings of the Second Workshop on Universal Dependencies (UDW 2018), pp. 91–101, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-6011. URL
  • Mu & Andreas (2020) Mu, J. and Andreas, J. Compositional explanations of neurons. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 17153–17163. Curran Associates, Inc., 2020. URL
  • Murphy (2012) Murphy, K. P. Machine Learning: A Probabilistic Perspective. Adaptive Computation and Machine Learning Series. MIT Press, Cambridge, MA, 2012. ISBN 978-0-262-01802-9. URL
  • Nair & Hinton (2010) Nair, V. and Hinton, G. E.

    Rectified linear units improve restricted Boltzmann machines.

    In Proceedings of the 27th International Conference on International Conference on Machine Learning, pp. 807–814, Madison, WI, USA, June 2010. ISBN 978-1-60558-907-7. URL
  • Nivre et al. (2017) Nivre, J., Agić, Ž., Ahrenberg, L., Antonsen, L., Aranzabe, M. J., Asahara, M., Ateyah, L., Attia, M., Atutxa, A., Augustinus, L., Badmaeva, E., Ballesteros, M., Banerjee, E., Bank, S., Barbu Mititelu, V., Bauer, J., Bengoetxea, K., Bhat, R. A., Bick, E., Bobicev, V., Börstell, C., Bosco, C., Bouma, G., Bowman, S., Burchardt, A., Candito, M., Caron, G., Cebiroğlu Eryiğit, G., Celano, G. G. A., Cetin, S., Chalub, F., Choi, J., Cinková, S., Çöltekin, Ç., Connor, M., Davidson, E., de Marneffe, M.-C., de Paiva, V., Diaz de Ilarraza, A., Dirix, P., Dobrovoljc, K., Dozat, T., Droganova, K., Dwivedi, P., Eli, M., Elkahky, A., Erjavec, T., Farkas, R., Fernandez Alcalde, H., Foster, J., Freitas, C., Gajdošová, K., Galbraith, D., Garcia, M., Gärdenfors, M., Gerdes, K., Ginter, F., Goenaga, I., Gojenola, K., Gökırmak, M., Goldberg, Y., Gómez Guinovart, X., Gonzáles Saavedra, B., Grioni, M., Grūzītis, N., Guillaume, B., Habash, N., Hajič, J., Hajič jr., J., Hà Mỹ, L., Harris, K., Haug, D., Hladká, B., Hlaváčová, J., Hociung, F., Hohle, P., Ion, R., Irimia, E., Jelínek, T., Johannsen, A., Jørgensen, F., Kaşıkara, H., Kanayama, H., Kanerva, J., Kayadelen, T., Kettnerová, V., Kirchner, J., Kotsyba, N., Krek, S., Laippala, V., Lambertino, L., Lando, T., Lee, J., Lê Hồng, P., Lenci, A., Lertpradit, S., Leung, H., Li, C. Y., Li, J., Li, K., Ljubešić, N., Loginova, O., Lyashevskaya, O., Lynn, T., Macketanz, V., Makazhanov, A., Mandl, M., Manning, C., Mărănduc, C., Mareček, D., Marheinecke, K., Martínez Alonso, H., Martins, A., Mašek, J., Matsumoto, Y., McDonald, R., Mendonça, G., Miekka, N., Missilä, A., Mititelu, C., Miyao, Y., Montemagni, S., More, A., Moreno Romero, L., Mori, S., Moskalevskyi, B., Muischnek, K., Müürisep, K., Nainwani, P., Nedoluzhko, A., Nešpore-Bērzkalne, G., Nguyễn Thị, L., Nguyễn Thị Minh, H., Nikolaev, V., Nurmi, H., Ojala, S., Osenova, P., Östling, R., Øvrelid, L., Pascual, E., Passarotti, M., Perez, C.-A., Perrier, G., Petrov, S., Piitulainen, J., Pitler, E., Plank, B., Popel, M., Pretkalniņa, L., Prokopidis, P., Puolakainen, T., Pyysalo, S., Rademaker, A., Ramasamy, L., Rama, T., Ravishankar, V., Real, L., Reddy, S., Rehm, G., Rinaldi, L., Rituma, L., Romanenko, M., Rosa, R., Rovati, D., Sagot, B., Saleh, S., Samardžić, T., Sanguinetti, M., Saulīte, B., Schuster, S., Seddah, D., Seeker, W., Seraji, M., Shen, M., Shimada, A., Sichinava, D., Silveira, N., Simi, M., Simionescu, R., Simkó, K., Šimková, M., Simov, K., Smith, A., Stella, A., Straka, M., Strnadová, J., Suhr, A., Sulubacak, U., Szántó, Z., Taji, D., Tanaka, T., Trosterud, T., Trukhina, A., Tsarfaty, R., Tyers, F., Uematsu, S., Urešová, Z., Uria, L., Uszkoreit, H., Vajjala, S., van Niekerk, D., van Noord, G., Varga, V., Villemonte de la Clergerie, E., Vincze, V., Wallin, L., Washington, J. N., Wirén, M., Wong, T.-s., Yu, Z., Žabokrtský, Z., Zeldes, A., Zeman, D., and Zhu, H. Universal dependencies 2.1, 2017. URL LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University.
  • Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc., 2019. URL
  • Peters et al. (2018) Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1202. URL
  • Pimentel et al. (2020) Pimentel, T., Valvoda, J., Hall Maudslay, R., Zmigrod, R., Williams, A., and Cotterell, R. Information-theoretic probing for linguistic structure. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4609–4622, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.420. URL
  • Poliak et al. (2018) Poliak, A., Haldar, A., Rudinger, R., Hu, J. E., Pavlick, E., White, A. S., and Van Durme, B. Collecting diverse natural language inference problems for sentence representation evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 67–81, Brussels, Belgium, October 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1007. URL
  • Raffel et al. (2020) Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J.

    Exploring the limits of transfer learning with a unified text-to-text transformer.

    Journal of Machine Learning Research, 21(140):1–67, 2020. URL
  • Ravichander et al. (2021) Ravichander, A., Belinkov, Y., and Hovy, E. Probing the probing paradigm: Does probing accuracy entail task relevance? In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 3363–3377, Online, April 2021. Association for Computational Linguistics. URL
  • Rethmeier et al. (2020) Rethmeier, N., Saxena, V. K., and Augenstein, I. TX-Ray: Quantifying and explaining model-knowledge transfer in (un-)supervised NLP. In Adams, R. P. and Gogate, V. (eds.), Proceedings of the Thirty-Sixth Conference on Uncertainty in Artificial Intelligence, pp. 197. AUAI Press, 2020. URL
  • Rogers et al. (2020) Rogers, A., Kovaleva, O., and Rumshisky, A. A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computational Linguistics, 8:842–866, 2020. doi: 10.1162/tacl_a_00349. URL
  • Rush (2020) Rush, A. Torch-Struct: Deep structured prediction library. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 335–342, Online, July 2020. Association for Computational Linguistics. URL
  • Şahin et al. (2020) Şahin, G. G., Vania, C., Kuznetsov, I., and Gurevych, I. LINSPECTOR: Multilingual probing tasks for word representations. Computational Linguistics, 46(2):335–385, 2020. doi: 10.1162/coli
  • Tang et al. (2020) Tang, G., Sennrich, R., and Nivre, J.

    Understanding pure character-based neural machine translation: The case of translating Finnish into English.

    In Proceedings of the 28th International Conference on Computational Linguistics, pp. 4251–4262, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics. doi: 10.18653/v1/2020.coling-main.375. URL
  • Tenney et al. (2018) Tenney, I., Xia, P., Chen, B., Wang, A., Poliak, A., McCoy, R. T., Kim, N., Durme, B. V., Bowman, S. R., Das, D., and Pavlick, E. What do you learn from context? Probing for sentence structure in contextualized word representations. In International Conference on Learning Representations, September 2018. URL
  • Torroba Hennigen et al. (2020) Torroba Hennigen, L., Williams, A., and Cotterell, R. Intrinsic probing through dimension selection. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 197–216, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.15. URL
  • Vig et al. (2020) Vig, J., Gehrmann, S., Belinkov, Y., Qian, S., Nevo, D., Singer, Y., and Shieber, S. Investigating gender bias in language models using causal mediation analysis. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 12388–12401. Curran Associates, Inc., 2020. URL
  • Voita & Titov (2020) Voita, E. and Titov, I. Information-theoretic probing with minimum description length. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 183–196, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.14. URL
  • Vulić et al. (2020) Vulić, I., Ponti, E. M., Litschko, R., Glavaš, G., and Korhonen, A. Probing pretrained language models for lexical semantics. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 7222–7240, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.586. URL
  • Williams (1992) Williams, R. J.

    Simple statistical gradient-following algorithms for connectionist reinforcement learning.

    Machine Learning, 8:229–256, 1992. URL
  • Wolf et al. (2020) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., and Brew, J. HuggingFace’s Transformers: State-of-the-art natural language processing. arXiv:1910.03771 [cs], February 2020.
  • Zhang & Bowman (2018) Zhang, K. and Bowman, S. Language modeling teaches you more than translation does: Lessons learned through auxiliary syntactic task analysis. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 359–361, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-5448. URL
  • Zou & Hastie (2005) Zou, H. and Hastie, T. Regularization and variable selection via the Elastic Net. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 67(2):301–320, 2005. ISSN 1369-7412. URL

Appendix A Variational Lower Bound

The derivation of the variational lower bound is shown below:


Appendix B List of Probed Morphosyntactic Attributes

The 29 language–attribute pairs we probe for in this work are listed below:

  • Arabic: Aspect, Case, Definiteness, Gender, Mood, Number, Voice

  • English: Number, Tense

  • Finnish: Case, Number, Person, Tense, Voice

  • Polish: Animacy, Case, Gender, Number, Tense

  • Portuguese: Gender, Number, Tense

  • Russian: Animacy, Aspect, Case, Gender, Number, Tense, Voice

Appendix C Training and Hyperparameter Tuning

We train our probes for a maximum of epochs using the Adam optimizer (Kingma & Ba, 2015). We add early stopping with a patience of as a regularization technique. Early stopping is conducted by holding out 10% of the training data; our development set is reserved for the greedy selection of subsets of neurons. Our implementation is built with PyTorch (Paszke et al., 2019). To execute a fair comparison with Dalvi et al. (2019), we train all probes other than the Gaussian probe using ElasticNet regularization (Zou & Hastie, 2005), which consists of combining both and regularization, where the regularizers are weighted by tunable regularization coefficients and , respectively. We follow the experimental set-up proposed by Dalvi et al. (2019), where we set for all probes. In a preliminary experiment, we performed a grid search over these hyperparameters to confirm that the probe is not very sensitive to the tuning of these values (unless they are extreme) which aligns with the claim presented in Dalvi et al. (2019). For Gaussian, we take the MAP estimate, with a weak data-dependent prior (Murphy, 2012, Chapter 4). In addition, we found that a slight improvement in the performance of Poisson and Conditional Poisson was obtained by scaling the entropy term in eq. 3 by a factor of .

Appendix D Supplementary Results

Tab. 3 compares the accuracy of our two models, Poisson and Conditional Poisson, to the Linear, Gaussian and Upper Bound baselines. The table reflects the trend observed in Tab. 2: Poisson and Conditional Poisson generally outperform the Linear baseline. However, Gaussian achieves highers accuracy with exception of a high dimension regimen.

author=lucas,color=dandelion!60,size=,fancyline,caption=,inline,author=lucas,color=dandelion!60,size=,fancyline,caption=,inline,todo: author=lucas,color=dandelion!60,size=,fancyline,caption=,inline,This table suggests that Gaussian is better than our probe! We should probably explain this in the text.
Cond. Poisson
Cond. Poisson
Upper Bound
Table 3: Mean and standard deviation of accuracy for the Poisson, Conditional Poisson, Linear (Dalvi et al., 2019) and Gaussian (Torroba Hennigen et al., 2020) probes for all language–attribute pairs (above) and for the Conditional Poisson, Poisson and Upper Bound for 6 selected language–attribute pairs (below) for each of the subset sizes. We sampled 100 different subsets of BERT dimensions at random.
Figure 5: The percentage overlap between the top-30 most informative gender (left) and case (right) dimensions in BERT for the probed languages. Statistically significant overlap, after Holm–Bonferroni family-wise error correction (Holm, 1979), with , is marked with an orange square.