1 Introduction
There have been considerable improvements to the quality of pretrained contextualized representations in recent years (e.g., Peters et al., 2018; Devlin et al., 2019; Raffel et al., 2020). These advances have sparked an interest in understanding what linguistic information may be lurking within the representations themselves (Poliak et al., 2018; Zhang & Bowman, 2018; Rogers et al., 2020, inter alia). One philosophy that has been proposed to extract this information is called probing
, the task of training an external classifier to predict the linguistic property of interest directly from the representations. The hope of probing is that it sheds light onto how much linguistic knowledge is present in representations and, perhaps, how that information is structured. Probing has grown to be a fruitful area of research, with researchers probing for morphological
(Tang et al., 2020; Ács et al., 2021), syntactic (Voita & Titov, 2020; Hall Maudslay et al., 2020; Ács et al., 2021), and semantic (Vulić et al., 2020; Tang et al., 2020) information.In this paper, we focus on one type of probing known as intrinsic probing (Dalvi et al., 2019; Torroba Hennigen et al., 2020), a subset of which specifically aims to ascertain how information is structured within a representation. This means that we are not solely interested in determining whether a network encodes the tense of a verb, but also in pinpointing exactly whichneurons in the network are responsible for encoding the property. Unfortunately, the naïve formulation of intrinsic probing requires one to analyze all possible combinations of neurons, which is intractable even for the smallest representations used in modernday NLP. For example, analyzing all combinations of 768dimensional BERT word representations would require us to train different probes, one for each combination of neurons, which far exceeds the estimated number of atoms in the observable universe.
To obviate this difficulty, we introduce a novel latentvariable probe for discriminative intrinsic probing. The core idea of this approach is that instead of training a different probe for each combination of neurons, we introduce a subsetvalued latent variable. We approximately marginalize over the latent subsets using variational inference. Training the probe in this manner results in a set of parameters which work well across all possible subsets. We propose two variational families to model the posterior over the latent subsetvalued random variables, both based on common sampling designs: Poisson sampling, which selects each neuron based on independent Bernoulli trials, and conditional Poisson sampling, which first samples a fixed number of neurons from a uniform distribution and then a subset of neurons of that size
(Lohr, 2019). Conditional Poisson sampling offers the modeler more control over the distribution over subset sizes; they may pick the parametric distribution themselves.We compare both variants to the two main intrinsic probing approaches we are aware of in the literature (§ 5). To do so, we train probes for 29 morphosyntactic properties across 6 languages (English, Portuguese, Polish, Russian, Arabic, and Finnish) from the Universal Dependencies (UD; Nivre et al. 2017) treebanks. We show that, in general, both variants of our method yield tighter estimates of the mutual information, though the model based on conditional Poisson sampling yields slightly better performance. This suggests that they are better at quantifying the informational content encoded in mBERT representations (Devlin et al., 2019). We make two typological findings when applying our probe. We show that there is a difference in how information is structured depending on the language with certain language–attribute pairs requiring more dimensions to encode relevant information. We also analyze whether neural representations are able to learn crosslingual abstractions from multilingual corpora. We confirm this statement and observe a strong overlap in the most informative dimensions, especially for number (Fig. 1). In an additional experiment, we show that our method supports training deeper probes (§ 5.4), though the advantages of nonlinear probes over their linear counterparts are modest.
2 Intrinsic Probing
The success behind pretrained contextual representations such as BERT (Devlin et al., 2019)
suggests that they may offer a continuous analogue of the discrete structures in language, such as morphosyntactic attributes number, case, or tense. Intrinsic probing aims to recognize the parts of a network (assuming they exist) which encode such structures. In this paper, we will operate exclusively at the level of the neuron—in the case of BERT, this is one component of the 768dimensional vector the model outputs. However, our approach can easily generalize to other settings, e.g., the layers in a transformer or filters of a convolutional neural network. Identifying individual neurons responsible for encoding linguistic features of interest has previously been shown to increase model transparency
(Bau et al., 2019). In fact, knowledge about which neurons encode certain properties has also been employed to mitigate potential biases (Vig et al., 2020), for controllable text generation
(Bau et al., 2019), and to analyze the linguistic capabilities of language models (Lakretz et al., 2019).To formally describe our intrinsic probing framework, we first introduce some notation. We define to be the set of values that some property of interest can take, e.g., for the morphosyntactic number attribute. Let be a dataset of label–representation pairs: is a linguistic property and is a representation. Additionally, let be the set of all neurons in a representation; in our setup, it is an integer range. In the case of BERT, we have . Given a subset of dimensions , we write for the subvector of which contains only the dimensions present in .
Let be a probe—a classifier trained to predict from a subvector . In intrinsic probing, our goal is to find the size subset of neurons
which are most informative about the property of interest. This may be written as the following combinatorial optimization problem
(Torroba Hennigen et al., 2020):(1) 
To exhaustively solve Eq. 1, we would have to train a probe for every one of the exponentially many subsets of size . Thus, exactly solving eq. 1 is infeasible, and we are forced to rely on an approximate solution, e.g., greedily selecting the dimension that maximizes the objective. However, greedy selection alone is not enough to make solving eq. 1 manageable; because we must retrain for every subset considered during the greedy selection procedure, i.e., we would end up training classifiers. As an example, consider what would happen if one used a greedy selection scheme to find the 50 most informative dimensions for a property on 768dimensional BERT representations. To select the first dimension, one would need to train 768 probes. To select the second dimension, one would train an additional 767, and so forth. After 50 dimensions, one would have trained 37893 probes. To address this problem, our paper introduces a latentvariable probe, which identifies a that can be used for any combination of neurons under consideration allowing a greedy selection procedure to work in practice.
3 A LatentVariable Probe
The technical contribution of this work is a novel latentvariable model for intrinsic probing. Our method starts with a generic probabilistic probe which predicts a linguistic attribute given a subset of the hidden dimensions; is then used to subset into . To avoid training a unique probe for every possible subset , we propose to integrate a prior over subsets into the model and then to marginalize out all possible subsets of neurons:
(2) 
Due to this marginalization, our likelihood is not dependent on any specific subset of neurons . Throughout this paper we opted for a noninformative, uniform prior , but other distributions are also possible.
Our goal is to estimate the parameters . We achieve this by maximizing the loglikelihood of the training data with respect to the parameters . Unfortunately, directly computing this involves a sum over all possible subsets of —a sum with an exponential number of summands. Thus, we resort to a variational approximation. Let be a distribution over subsets, parameterized by parameters ; we will use to approximate the true posterior distribution. Then, the loglikelihood is lowerbounded as follows:
(3)  
which follows from Jensen’s inequality, where is the entropy of .^{1}^{1}1See App. A for the full derivation.
Our likelihood is general, and can take the form of any objective function. This means that we can use this approach to train intrinsic probes with any type of architecture amenable to gradientbased optimization, e.g., neural networks. However, in this paper, we use a linear classifier unless stated otherwise. Further, note that
eq. 3 is valid for any choice of . We explore two variational families for , each based on a common sampling technique. The first (herein Poisson) applies Poisson sampling (Hájek, 1964), which assumes each neuron to be subjected to an independent Bernoulli trial. The second one (Conditional Poisson; Aires, 1999) corresponds to conditional Poisson sampling, which can be defined as conditioning a Poisson sample by a fixed sample size.3.1 Parameter Estimation
As mentioned above, exact computation of the loglikelihood is intractable due to the sum over all possible subsets of . Thus, we optimize the variational bound presented in eq. 3
. We optimize the bound through stochastic gradient descent with respect to the model parameters
and the variational parameters , a technique known as stochastic variational inference (Hoffman et al., 2013). However, one final trick is necessary, since the variational bound still includes a sum over all subsets in the first term:(4)  
where we take Monte Carlo samples to approximate the sum. In the case of the gradient with respect to , we also have to apply the REINFORCE trick (Williams, 1992):
(5)  
where we again take
Monte Carlo samples. This procedure leads to an unbiased estimate of the gradient of the variational approximation.
3.2 Choice of Variational Family .
We consider two choices of variational family , both based on sampling designs (Lohr, 2019). Each defines a parameterized distribution over all subsets of .
Poisson Sampling.
Poisson sampling is one of the simplest sampling designs. In our setting, each neuron is given a unique nonnegative weight . This gives us the following parameterized distribution over subsets:
(6) 
The formulation in Eq. 6 shows that taking a sample corresponds to
independent coin flips—one for each neuron—where the probability of heads is
. The entropy of a Poisson sampling may be computed in time:(7) 
where . The gradient of eq. 7
may be computed automatically through backpropagation. Poisson sampling automatically modules the size of the sampled set
and we have the expected size .Conditional Poisson Sampling.
We also consider a variational family that factors as follows:
(8) 
In this paper, we take , but a more complex distribution, e.g., a Categorical, could be learned. We define as a conditional Poisson sampling design. Similarly to Poisson sampling, conditional Poisson sampling starts with a unique positive weight associated with every neuron . However, an additional cardinality constraint is introduced. This leads to the following distribution
(9) 
A more elaborate dynamic program which runs in may be used to compute efficiently (Aires, 1999). We may further compute the entropy and its the gradient in time using the expectation semiring (Eisner, 2002; Li & Eisner, 2009). Sampling from can be done efficiently using quantities computed when running the dynamic program used to compute (Kulesza, 2012).^{2}^{2}2In practice, we use the semiring implementations by Rush (2020).
4 Experimental Setup
Our setup is virtually identical to the morphosyntactic probing setup of Torroba Hennigen et al. (2020). This consists of first automatically mapping treebanks from UD v2.1 (Nivre et al., 2017) to the UniMorph (McCarthy et al., 2018) schema.^{3}^{3}3We use the code available at: https://github.com/unimorph/udcompatibility. Then, we compute multilingual BERT (mBERT) representations^{4}^{4}4We use the implementation by Wolf et al. (2020). for every sentence in the UD treebanks. After computing the mBERT representations for the entire sentence, we extract representations for individual words in the sentence and pair them with the UniMorph morphosyntactic annotations. We estimate our probes’ parameters using the UD training set and conduct greedy selection to approximate the objective in Eq. 1 on the validation set; finally, we report the results on the test set, i.e., we test whether the set of neurons we found on the development set generalizes to heldout data. Additionally, we discard values that occur fewer than 20 times across splits. When feeding as input to our probes, we set any dimensions that are not present in to zero. We select as the number of Monte Carlo samples since we found this to work adequately in smallscale experiments. We compare the performance of the probes on 29 different language–attribute pairs (listed in App. B).
Since the performance of a probe on a specific subset of dimensions is related to both the subset itself (e.g., whether it is informative or not) and the number of dimensions being evaluated (e.g., if a probe is trained to expect 768 dimensions as input, it might work best when few or no dimensions are filled with zeros), we sample 100 subsets of dimensions with 5 different possible sizes (we considered 10, 50, 100, 250, 500 dim.) and compare every model’s performance on each of those subset sizes. Further details about training and hyperparameter settings are provided in
App. C.4.1 Baselines
We compare our latentvariable probe against two other recently proposed intrinsic probing methods as baselines.

Torroba Hennigen et al. (2020):
Our first baseline is a generative probe that models the joint distribution of representations and their properties
, where the representation distribution is assumed to be Gaussian. Torroba Hennigen et al. (2020)report that a major limitation of this probe is that if certain dimensions of the representations are not distributed according to a Gaussian distribution, then probe performance will suffer.

Dalvi et al. (2019): Our second baseline is a linear classifier, where dimensions not under consideration are zeroed out during evaluation (Dalvi et al., 2019; Durrani et al., 2020).^{5}^{5}5We note that they do not conduct intrinsic probing via dimension selection: Instead, they use the absolute magnitude of the weights as a proxy for dimension importance. In this paper, we adopt the approach of (Torroba Hennigen et al., 2020) and use the performancebased objective in eq. 1. Their approach is a special case of our proposed latentvariable model, where is fixed, so that on every training iteration the entire set of dimensions is sampled.
Additionally, we compare our methods to a naïve approach, a probe that is retrained for every set of dimensions under consideration selecting the dimension that maximizes the objective (herein Upper Bound).^{6}^{6}6The Upper Bound yields the tightest estimate on the mutual information, however as mentioned in § 2, this is unfeasible since it requires retraining for every different combination of neurons. For comparison, in English number, on an Nvidia RTX 2070 GPU, our Poisson, Gaussian and Linear experiments take a few minutes or even seconds to run, compared to Upper Bound which takes multiple hours. Due to computational cost, we limit our comparisons with Upper Bound to 6 randomly chosen morphosyntactic attributes,^{7}^{7}7English–Number, Portuguese–Gender and Noun Class, Polish–Tense, Russian–Voice, Arabic–Case, Finnish–Tense each in a different language.
4.2 Metrics
We compare our proposed method to the baselines above under two metrics: accuracy and mutual information (MI). We report mutual information, which has recently been proposed as an evaluation metric for probes
(Pimentel et al., 2020). Here, mutual information (MI) is a function between a valued random variable and a valued random variable over masked representations:(10) 
where is the inherent entropy of the property being probed and is constant with respect to ; is the entropy over the property given the representations . Exact computation of the mutual information is intractable; however, we can lowerbound the MI by approximating using our probe’s average negative loglikelihood: on heldout data. See Brown et al. (1992) for a derivation.
We normalize the mutual information (NMI) by dividing the MI by the entropy which turns it into a percentage and is, arguably, more interpretable. We refer the reader to Gates et al. (2019) for a discussion of the normalization of MI.
We also report accuracy which is a standard measure for evaluating probes as it is for evaluating classifiers in general. However, accuracy can be a misleading measure especially on imbalanced datasets since it considers solely correct predictions.
4.3 What Makes a Good Probe?
Since we report a lower bound on the mutual information (§ 4), we deem the best probe to be the one that yields the tightest mutual information estimate, or, in other words, the one that achieves the highest mutual information estimate; this is equivalent to having the best crossentropy on heldout data, which is the standard evaluation metric for language modeling.
However, in the context of intrinsic probing, the topic of primary interest is what the probe reveals about the structure of the representations. For instance, does the probe reveal that the information encoded in the embeddings is focalized or dispersed across many neurons? Several prior works (e.g., Lakretz et al., 2019) focus on the single neuron setting, which is a special, very focal case. To engage with this prior work, we compare probes not only with respect to their performance (MI and accuracy), but also with respect to the size of the subset of dimensions being evaluated, i.e., the size of set .
We acknowledge that there is a disparity between the quantitative evaluation we employ, in which probes are compared based on their MI estimates, and qualitative nature of intrinsic probing, which aims to identify the substructures of a model that encode a property of interest. However, it is nontrivial to evaluate fundamentally qualitative procedures in a largescale, systematic, and unbiased manner. Therefore, we rely on the quantitative evaluation metrics presented in § 4.2, while also qualitatively inspecting the implications of our probes.
5 Results
In this section, we present the results of our empirical investigation. First, we address our main research question: Does our latentvariable probe presented in §3 outperform previously proposed intrinsic probing methods (§5.1)? Second, we analyze the structure of the most informative mBERT neurons for the different morphosyntactic attributes we probe for (§5.2). Finally, we investigate whether knowledge about morphosyntax encoded in neural representations is shared across languages (§5.3). In §5.4, we show that our latentvariable probe is flexible enough to support deep neural probes.
5.1 How Do Our Methods Perform?
The main question we ask is how the performance of our models compares to existing intrinsic probing approaches. To investigate this research question, we compare the performance of the Poisson and Conditional Poisson probes to Linear (Dalvi et al., 2019) and Gaussian (Torroba Hennigen et al., 2020). Refer to § 4.3 for a discussion of the limitations of our method. ^{author=lucas,color=dandelion!60,size=,fancyline,caption=,}^{author=lucas,color=dandelion!60,size=,fancyline,caption=,}todo: author=lucas,color=dandelion!60,size=,fancyline,caption=,Eventually (but not now), we should add a footnote that reads “In concurrent work, Antverg & Belinkov delve into the limitations of our approach.”
Number of dimensions  
Gaussian  
C. Poisson  0.50  0.58  0.70  0.99  1.00 
Poisson  0.21  0.49  0.66  0.98  1.00 
Linear  
C. Poisson  0.99  1.00  1.00  1.00  0.98 
Poisson  0.95  0.99  1.00  1.00  0.97 
Probe  

Cond. Poisson  
Poisson  
Linear  
Gaussian  
Cond. Poisson  
Poisson  
Upper Bound 
Mean and standard deviation of NMI for the
Poisson, Conditional Poisson, Linear (Dalvi et al., 2019) and Gaussian (Torroba Hennigen et al., 2020) probes for all language–attribute pairs (top) and mean NMI and standard deviation for the Conditional Poisson, Poisson and Upper Bound for 6 selected language–attribute pairs (bottom). For each subset size considered, we take our averages over 100 randomly sampled subsets of BERT dimensions.In general, Conditional Poisson tends to outperform Poisson at lower dimensions, however, Poisson tends to catch up as more dimensions are added. Our results suggest that both variants of our latentvariable model from § 3 are effective and generally outperform the Linear baseline as shown in Tab. 1. The Gaussian baseline tends to perform similarly to Conditional Poisson when we consider subsets of 10 dimensions, and it outperforms Poisson substantially. However, for subsets of size or more, both Conditional Poisson and Poisson are preferable. We believe that the robust performance of Gaussian in the lowdimensional regimen can be attributed to its ability to model nonlinear decision boundaries (Murphy, 2012, Chapter 4).
The trends above are corroborated by a comparison of the mean NMI (Tab. 2, top) achieved by each of these probes for different subset sizes. However, in terms of accuracy (see Tab. 3 in App. D), while both Conditional Poisson and Poisson generally outperform Linear, Gaussian tends to achieve higher accuracy than our methods. Notwithstanding, Gaussian’s performance (in terms of NMI) is not stable and can yield low or even negative mutual information estimates across all subsets of dimensions. Adding a new dimension can never decrease the mutual information, so the observable decreases occur because the generative model deteriorates upon adding another dimension, which validates Torroba Hennigen et al.’s claim that some dimensions are not adequately modeled by the Gaussian assumption. While these results suggest that Gaussian may be preferable if performing a comparison based on accuracy, the instability of Gaussian when considering NMI suggests that this edge in terms of accuracy comes at a hefty cost in terms of calibration (Guo et al., 2017).^{8}^{8}8While accuracy only cares about whether predictions are correct, NMI penalizes miscalibrated predictions. This is the case because it is proportional to the negative log likelihood (Guo et al., 2017). ^{author=lucas,color=dandelion!60,size=,fancyline,caption=,}^{author=lucas,color=dandelion!60,size=,fancyline,caption=,}todo: author=lucas,color=dandelion!60,size=,fancyline,caption=,We have to be careful here; our accuracy table doesn’t show that clear of an improvement. We should probably elaborate on this.^{author=karolina,color=orange!10,size=,fancyline,caption=,}^{author=karolina,color=orange!10,size=,fancyline,caption=,}todo: author=karolina,color=orange!10,size=,fancyline,caption=,Tried to rewrite it a bit, what do you think?^{author=lucas,color=dandelion!60,size=,fancyline,caption=,}^{author=lucas,color=dandelion!60,size=,fancyline,caption=,}todo: author=lucas,color=dandelion!60,size=,fancyline,caption=,I think we need to address this headon, especially since NMI is a “strange” metric compared to accuracy. What do you think?
Further, we compare the Poisson and Conditional Poisson probes to the Upper Bound baseline. This is expected to be the highest performing since it is retrained for every subset under consideration and indeed, this assumption is confirmed by the results in Tab. 2 (bottom). The difference between our probes’ performance and the Upper Bound baseline’s performance can be seen as the cost of sharing parameters across all subsets of dimensions, and an effective intrinsic probe should minimize this.
We also conduct a direct comparison of Linear, Gaussian, Poisson and Conditional Poisson when used to identify the most informative subsets of dimensions. The average MI reported by each model across all 29 morphosyntactic language–attribute pairs is presented in Fig. 2. On average, Conditional Poisson offers comparable performance to Gaussian at low dimensionalities for both NMI and accuracy, though the latter tends to yield a slightly higher (and thus a tighter) bound on the mutual information. However, as more dimensions are taken into consideration, our models vastly outperform Gaussian. Poisson and Conditional Poisson perform comparably at high dimensions, but Conditional Poisson performs slightly better for 1–20 dimensions. Poisson outperforms Linear at high dimensions, and Conditional Poisson outperforms Linear for all dimensions considered. These effects are less pronounced for accuracy, which we believe to be due to accuracy’s insensitivity to a probe’s confidence in its prediction.
5.2 Information Distribution
We compare performance of the Conditional Poisson
probe for each attribute for all available languages in order to better understand the relatively high NMI variance across results (see
Tab. 2). In Fig. 3 we plot the average NMI for gender^{author=lucas,color=dandelion!60,size=,fancyline,caption=,}^{author=lucas,color=dandelion!60,size=,fancyline,caption=,}todo: author=lucas,color=dandelion!60,size=,fancyline,caption=,Hmmm.. should we do this for the greedy method instead? The current interpretation is that any subset of dimensions will encode less information about gender if there are more classes.^{author=karolina,color=orange!10,size=,fancyline,caption=,}^{author=karolina,color=orange!10,size=,fancyline,caption=,}todo: author=karolina,color=orange!10,size=,fancyline,caption=,Yes, already changed to greedy selection. Need to check what is the reason for the peak around 34 dimensions. and observe that languages with two genders present (Arabic and Portuguese) achieve higher performance than languages with three genders (Russian and Polish) which is an intuitive result due to increased task complexity. Further, we see that the slopes for both Russian and Polish are flatter, especially at lower dimensions. This implies that the information for Russian and Polish is more dispersed and more dimensions are needed to capture the typological information.5.3 CrossLingual Overlap
We compare the most informative mBERT dimensions recovered by our probe across languages and find that, in many cases, the same set of neurons may express the same morphosyntactic phenomena across languages. For example, we find that Russian, Polish, Portuguese, English and Arabic all have statistically significant overlap in the top30 most informative neurons for number (Fig. 1). Similarly, we observe presence of statistically significant overlap for gender (Fig. 5, left). This effect is particularly strong between Russian and Polish, where we additionally find statistically significant overlap between top30 neurons for case (Fig. 5, right). These results might indicate that BERT may be leveraging data from other languages to develop a crosslingually entangled notion of morphosyntax (Torroba Hennigen et al., 2020), and that this effect that may be particularly strong between typologically similar languages.^{9}^{9}9In concurrent work, Antverg & Belinkov (2021) find evidence supporting a similar phenomenon.
5.4 How Do Deeper Probes Perform?
Multiple papers have promoted the use of linear probes (Tenney et al., 2018; Liu et al., 2019), in part because they are ostensibly less likely to memorize patterns in the data (Zhang & Bowman, 2018; Hewitt & Liang, 2019), though this is subject to debate (Voita & Titov, 2020; Pimentel et al., 2020). Here we verify our claim from § 3 that our probe can be applied to any kind of discriminative probe architecture as our objective function can be optimized using gradient descent.
We follow the setup of Hewitt & Liang (2019), and test MLP1 and MLP2 Conditional Poisson probes alongside a linear Conditional Poisson
probe. The MLP1 and MLP2 probes are multilayer perceptrons (MLP) with one and two hidden layer(s), respectively, and Rectified Linear Unit
(ReLU; Nair & Hinton,
2010)activation function.In Fig. 4, we can see that our method not only works well for deeper probes, but also outperforms the linear probe in terms of NMI. However, at higher dimensionalities, the advantage of a deeper probe diminishes. We also find that the difference in performance between MLP1 and MLP2 is negligible.
6 Related Work
A growing interest in interpretability has led to a flurry of work in trying to assess exactly what pretrained representations know about language. To this end, diverse methods have been employed, such as the construction of specific challenge sets that seek to evaluate how well representations model particular phenomena (Linzen et al., 2016; Gulordava et al., 2018; Goldberg, 2019; Goodwin et al., 2020), and visualization methods (Kádár et al., 2017; Rethmeier et al., 2020). Work on probing comprises a major share of this endeavor (Belinkov & Glass, 2019; Belinkov, 2021). This has taken the form of both focused studies on particular linguistic phenomena (e.g., subjectverb number agreement, Giulianelli et al., 2018) to broad assessments of contextual representations in a wide array of tasks (Şahin et al., 2020; Tenney et al., 2018; Conneau et al., 2018; Liu et al., 2019; Ravichander et al., 2021, inter alia).
Efforts have ranged widely, but most of these focus on extrinsic rather than intrinsic probing. Most work on the latter has focused primarily on ascribing roles to individual neurons through methods such as visualization (Karpathy et al., 2015; Li et al., 2016a) and ablation (Li et al., 2016b). For example, recently Lakretz et al. (2019)
conduct an indepth study of how long–shortterm memory networks
(LSTMs; Hochreiter & Schmidhuber, 1997) capture subject–verb number agreement, and identify two units largely responsible for this phenomenon.More recently, there has been a growing interest in extending intrinsic probing to collections of neurons. Bau et al. (2019) utilize unsupervised methods to identify important neurons, and then attempt to control a neural network’s outputs by selectively modifying them. Bau et al. (2020)
pursue a similar goal in a computer vision setting, but ascribe meaning to neurons based on how their activations correlate with particular classifications in images, and are able to control these manually with interpretable results. Aiming to answer questions on interpretability in computer vision and natural language inference,
Mu & Andreas (2020) develop a method to create compositional explanations of individual neurons and investigate abstractions encoded in them. Vig et al. (2020) analyze how information related to gender and societal biases is encoded in individual neurons and how it is being propagated through different model components.^{author=lucas,color=dandelion!60,size=,fancyline,caption=,}^{author=lucas,color=dandelion!60,size=,fancyline,caption=,}todo: author=lucas,color=dandelion!60,size=,fancyline,caption=,Karolina: Shortened this last sentence a bit, does it still hold?7 Conclusion
In this paper, we introduce a new method for training discriminative intrinsic probes that can perform well across any subset of dimensions. To do so, we train a probing classifier with a subsetvalued latent variable and demonstrate how the latent subsets can be marginalized using variational inference. We propose two variational families, based on common sampling designs, to model the posterior over subsets: Poisson sampling and conditional Poisson sampling. We demonstrate that both variants outperform our baselines in terms of mutual information, and that using a conditional Poisson variational family generally gives optimal performance. Further, we investigate information distribution for each attribute for all available languages. Finally, we find empirical evidence for overlap in the specific neurons used to encode morphosyntactic properties across languages.
References
 Ács et al. (2021) Ács, J., Kádár, Á., and Kornai, A. Subword pooling makes a difference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 2284–2295, Online, April 2021. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2021.eaclmain.194.
 Aires (1999) Aires, N. Algorithms to find exact inclusion probabilities for conditional Poisson sampling and Pareto s sampling designs. Methodology And Computing In Applied Probability, 1(4):457–469, Dec 1999. ISSN 15737713. doi: 10.1023/A:1010091628740. URL https://EconPapers.repec.org/RePEc:spr:metcap:v:1:y:1999:i:4:d:10.1023_a:1010091628740.
 Antverg & Belinkov (2021) Antverg, O. and Belinkov, Y. On the pitfalls of analyzing individual neurons in language models. arXiv:2110.07483 [cs.CL], 2021.
 Bau et al. (2019) Bau, A., Belinkov, Y., Sajjad, H., Durrani, N., Dalvi, F., and Glass, J. Identifying and controlling important neurons in neural machine translation. In International Conference on Learning Representations, 2019. URL https://arxiv.org/abs/1811.01157.
 Bau et al. (2020) Bau, D., Zhu, J.Y., Strobelt, H., Lapedriza, A., Zhou, B., and Torralba, A. Understanding the role of individual units in a deep neural network. Proceedings of the National Academy of Sciences, September 2020. ISSN 00278424, 10916490. doi: 10.1073/pnas.1907375117. URL https://www.pnas.org/content/117/48/30071.
 Belinkov (2021) Belinkov, Y. Probing classifiers: Promises, shortcomings, and alternatives. arXiv preprint arXiv:2102.12452, 2021. URL https://arxiv.org/abs/2102.12452.
 Belinkov & Glass (2019) Belinkov, Y. and Glass, J. Analysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics, 7:49–72, March 2019. doi: 10.1162/tacl_a_00254. URL https://doi.org/10.1162/tacl_a_00254.
 Brown et al. (1992) Brown, P. F., Della Pietra, S. A., Della Pietra, V. J., Lai, J. C., and Mercer, R. L. An estimate of an upper bound for the entropy of English. Computational Linguistics, 18(1):31–40, 1992. URL https://www.aclweb.org/anthology/J921002.pdf.
 Conneau et al. (2018) Conneau, A., Kruszewski, G., Lample, G., Barrault, L., and Baroni, M. What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2126–2136, Melbourne, Australia, 2018. Association for Computational Linguistics. doi: 10.18653/v1/P181198. URL https://aclanthology.org/P181198.

Dalvi et al. (2019)
Dalvi, F., Durrani, N., Sajjad, H., Belinkov, Y., Bau, A., and Glass, J.
What is one grain of sand in the desert? Analyzing individual
neurons in deep NLP models.
Proceedings of the AAAI Conference on Artificial Intelligence
, 33:6309–6317, July 2019. ISSN 23743468, 21595399. doi: 10.1609/aaai.v33i01.33016309. URL https://doi.org/10.1609/aaai.v33i01.33016309.  Devlin et al. (2019) Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. BERT: Pretraining of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N191423. URL https://aclanthology.org/N191423.
 Durrani et al. (2020) Durrani, N., Sajjad, H., Dalvi, F., and Belinkov, Y. Analyzing individual neurons in pretrained language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4865–4880, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlpmain.395. URL https://www.aclweb.org/anthology/2020.emnlpmain.395.
 Eisner (2002) Eisner, J. Parameter estimation for probabilistic finitestate transducers. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 1–8, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073085. URL https://www.aclweb.org/anthology/P021001.
 Gates et al. (2019) Gates, A. J., Wood, I. B., Hetrick, W. P., and Ahn, Y.Y. Elementcentric clustering comparison unifies overlaps and hierarchy. Scientific Reports, 9(1):8574, June 2019. ISSN 20452322. doi: 10.1038/s4159801944892y. URL https://doi.org/10.1038/s4159801944892y.
 Giulianelli et al. (2018) Giulianelli, M., Harding, J., Mohnert, F., Hupkes, D., and Zuidema, W. Under the hood: Using diagnostic classifiers to investigate and improve how language models track agreement information. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 240–248, Brussels, Belgium, 2018. Association for Computational Linguistics. doi: 10.18653/v1/W185426. URL https://aclanthology.org/W185426.
 Goldberg (2019) Goldberg, Y. Assessing BERT’s syntactic abilities. arXiv:1901.05287 [cs], January 2019.
 Goodwin et al. (2020) Goodwin, E., Sinha, K., and O’Donnell, T. J. Probing linguistic systematicity. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1958–1969, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.aclmain.177. URL https://aclanthology.org/2020.aclmain.177.
 Gulordava et al. (2018) Gulordava, K., Bojanowski, P., Grave, E., Linzen, T., and Baroni, M. Colorless green recurrent networks dream hierarchically. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1195–1205, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N181108. URL https://aclanthology.org/N181108.
 Guo et al. (2017) Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning  Volume 70, ICML’17, pp. 1321–1330. JMLR.org, Aug 2017. URL http://proceedings.mlr.press/v70/guo17a/guo17a.pdf.
 Hájek (1964) Hájek, J. Asymptotic theory of rejective sampling with varying probabilities from a finite population. The Annals of Mathematical Statistics, 35(4):1491–1523, 1964. ISSN 00034851. URL https://www.jstor.org/stable/2238287.
 Hall Maudslay et al. (2020) Hall Maudslay, R., Valvoda, J., Pimentel, T., Williams, A., and Cotterell, R. A tale of a probe and a parser. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7389–7395, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.aclmain.659. URL https://www.aclweb.org/anthology/2020.aclmain.659.
 Hewitt & Liang (2019) Hewitt, J. and Liang, P. Designing and interpreting probes with control tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pp. 2733–2743, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D191275. URL https://nlp.stanford.edu/pubs/hewitt2019control.pdf.
 Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long ShortTerm Memory. Neural Computation, 9(8):1735–1780, November 1997. ISSN 08997667. doi: 10.1162/neco.1997.9.8.1735. URL https://doi.org/10.1162/neco.1997.9.8.1735.
 Hoffman et al. (2013) Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. Stochastic variational inference. Journal of Machine Learning Research, 14(4):1303–1347, 2013. ISSN 15337928. URL https://www.jmlr.org/papers/volume14/hoffman13a/hoffman13a.pdf.
 Holm (1979) Holm, S. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2):65–70, 1979. ISSN 03036898. URL http://www.jstor.org/stable/4615733.

Kádár et al. (2017)
Kádár, Á., Chrupała, G., and Alishahi, A.
Representation of linguistic form and function in recurrent neural networks.
Computational Linguistics, 43(4):761–780, December 2017. doi: 10.1162/COLI_a_00300. URL https://aclanthology.org/J174003.  Karpathy et al. (2015) Karpathy, A., Johnson, J., and FeiFei, L. Visualizing and understanding recurrent networks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 24, 2016, Workshop Proceedings, November 2015. URL https://arxiv.org/pdf/1506.02078.
 Kingma & Ba (2015) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, San Diego, CA, May 2015. URL https://arxiv.org/abs/1412.6980.
 Kulesza (2012) Kulesza, A. Determinantal point processes for machine learning. Foundations and Trends in Machine Learning, 5(23):123–286, 2012. ISSN 19358245. doi: 10.1561/2200000044. URL http://dx.doi.org/10.1561/2200000044.
 Lakretz et al. (2019) Lakretz, Y., Kruszewski, G., Desbordes, T., Hupkes, D., Dehaene, S., and Baroni, M. The emergence of number and syntax units in LSTM language models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 11–20, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N191002. URL http://aclweb.org/anthology/N191002.
 Li et al. (2016a) Li, J., Chen, X., Hovy, E., and Jurafsky, D. Visualizing and understanding neural models in NLP. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 681–691, San Diego, California, 2016a. Association for Computational Linguistics. doi: 10.18653/v1/N161082. URL https://aclanthology.org/N161082.
 Li et al. (2016b) Li, J., Monroe, W., and Jurafsky, D. Understanding neural networks through representation erasure. CoRR, abs/1612.08220, 2016b. URL http://arxiv.org/abs/1612.08220.
 Li & Eisner (2009) Li, Z. and Eisner, J. First and secondorder expectation semirings with applications to minimumrisk training on translation forests. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 40–51, Singapore, August 2009. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/D091005.
 Linzen et al. (2016) Linzen, T., Dupoux, E., and Goldberg, Y. Assessing the ability of LSTMs to learn syntaxsensitive dependencies. Transactions of the Association for Computational Linguistics, 4:521–535, 2016. doi: 10.1162/tacl_a_00115. URL https://aclanthology.org/Q161037.
 Liu et al. (2019) Liu, N. F., Gardner, M., Belinkov, Y., Peters, M. E., and Smith, N. A. Linguistic knowledge and transferability of contextual representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1073–1094, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N191112. URL https://aclanthology.org/N191112.
 Lohr (2019) Lohr, S. L. Sampling: Design and Analysis. CRC Press, 2 edition, 2019. URL https://www.routledge.com/SamplingDesignandAnalysis/Lohr/p/book/9780367273415.
 McCarthy et al. (2018) McCarthy, A. D., Silfverberg, M., Cotterell, R., Hulden, M., and Yarowsky, D. Marrying universal dependencies and universal morphology. In Proceedings of the Second Workshop on Universal Dependencies (UDW 2018), pp. 91–101, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/W186011. URL https://aclanthology.org/W186011.
 Mu & Andreas (2020) Mu, J. and Andreas, J. Compositional explanations of neurons. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 17153–17163. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/c74956ffb38ba48ed6ce977af6727275Paper.pdf.
 Murphy (2012) Murphy, K. P. Machine Learning: A Probabilistic Perspective. Adaptive Computation and Machine Learning Series. MIT Press, Cambridge, MA, 2012. ISBN 9780262018029. URL https://mitpress.mit.edu/books/machinelearning1.

Nair & Hinton (2010)
Nair, V. and Hinton, G. E.
Rectified linear units improve restricted Boltzmann machines.
In Proceedings of the 27th International Conference on International Conference on Machine Learning, pp. 807–814, Madison, WI, USA, June 2010. ISBN 9781605589077. URL https://dl.acm.org/doi/10.5555/3104322.3104425.  Nivre et al. (2017) Nivre, J., Agić, Ž., Ahrenberg, L., Antonsen, L., Aranzabe, M. J., Asahara, M., Ateyah, L., Attia, M., Atutxa, A., Augustinus, L., Badmaeva, E., Ballesteros, M., Banerjee, E., Bank, S., Barbu Mititelu, V., Bauer, J., Bengoetxea, K., Bhat, R. A., Bick, E., Bobicev, V., Börstell, C., Bosco, C., Bouma, G., Bowman, S., Burchardt, A., Candito, M., Caron, G., Cebiroğlu Eryiğit, G., Celano, G. G. A., Cetin, S., Chalub, F., Choi, J., Cinková, S., Çöltekin, Ç., Connor, M., Davidson, E., de Marneffe, M.C., de Paiva, V., Diaz de Ilarraza, A., Dirix, P., Dobrovoljc, K., Dozat, T., Droganova, K., Dwivedi, P., Eli, M., Elkahky, A., Erjavec, T., Farkas, R., Fernandez Alcalde, H., Foster, J., Freitas, C., Gajdošová, K., Galbraith, D., Garcia, M., Gärdenfors, M., Gerdes, K., Ginter, F., Goenaga, I., Gojenola, K., Gökırmak, M., Goldberg, Y., Gómez Guinovart, X., Gonzáles Saavedra, B., Grioni, M., Grūzītis, N., Guillaume, B., Habash, N., Hajič, J., Hajič jr., J., Hà Mỹ, L., Harris, K., Haug, D., Hladká, B., Hlaváčová, J., Hociung, F., Hohle, P., Ion, R., Irimia, E., Jelínek, T., Johannsen, A., Jørgensen, F., Kaşıkara, H., Kanayama, H., Kanerva, J., Kayadelen, T., Kettnerová, V., Kirchner, J., Kotsyba, N., Krek, S., Laippala, V., Lambertino, L., Lando, T., Lee, J., Lê Hồng, P., Lenci, A., Lertpradit, S., Leung, H., Li, C. Y., Li, J., Li, K., Ljubešić, N., Loginova, O., Lyashevskaya, O., Lynn, T., Macketanz, V., Makazhanov, A., Mandl, M., Manning, C., Mărănduc, C., Mareček, D., Marheinecke, K., Martínez Alonso, H., Martins, A., Mašek, J., Matsumoto, Y., McDonald, R., Mendonça, G., Miekka, N., Missilä, A., Mititelu, C., Miyao, Y., Montemagni, S., More, A., Moreno Romero, L., Mori, S., Moskalevskyi, B., Muischnek, K., Müürisep, K., Nainwani, P., Nedoluzhko, A., NešporeBērzkalne, G., Nguyễn Thị, L., Nguyễn Thị Minh, H., Nikolaev, V., Nurmi, H., Ojala, S., Osenova, P., Östling, R., Øvrelid, L., Pascual, E., Passarotti, M., Perez, C.A., Perrier, G., Petrov, S., Piitulainen, J., Pitler, E., Plank, B., Popel, M., Pretkalniņa, L., Prokopidis, P., Puolakainen, T., Pyysalo, S., Rademaker, A., Ramasamy, L., Rama, T., Ravishankar, V., Real, L., Reddy, S., Rehm, G., Rinaldi, L., Rituma, L., Romanenko, M., Rosa, R., Rovati, D., Sagot, B., Saleh, S., Samardžić, T., Sanguinetti, M., Saulīte, B., Schuster, S., Seddah, D., Seeker, W., Seraji, M., Shen, M., Shimada, A., Sichinava, D., Silveira, N., Simi, M., Simionescu, R., Simkó, K., Šimková, M., Simov, K., Smith, A., Stella, A., Straka, M., Strnadová, J., Suhr, A., Sulubacak, U., Szántó, Z., Taji, D., Tanaka, T., Trosterud, T., Trukhina, A., Tsarfaty, R., Tyers, F., Uematsu, S., Urešová, Z., Uria, L., Uszkoreit, H., Vajjala, S., van Niekerk, D., van Noord, G., Varga, V., Villemonte de la Clergerie, E., Vincze, V., Wallin, L., Washington, J. N., Wirén, M., Wong, T.s., Yu, Z., Žabokrtský, Z., Zeldes, A., Zeman, D., and Zhu, H. Universal dependencies 2.1, 2017. URL http://hdl.handle.net/11234/12515. LINDAT/CLARIAHCZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University.
 Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. PyTorch: An imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740Paper.pdf.
 Peters et al. (2018) Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N181202. URL https://aclanthology.org/N181202.
 Pimentel et al. (2020) Pimentel, T., Valvoda, J., Hall Maudslay, R., Zmigrod, R., Williams, A., and Cotterell, R. Informationtheoretic probing for linguistic structure. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4609–4622, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.aclmain.420. URL https://aclanthology.org/2020.aclmain.420.
 Poliak et al. (2018) Poliak, A., Haldar, A., Rudinger, R., Hu, J. E., Pavlick, E., White, A. S., and Van Durme, B. Collecting diverse natural language inference problems for sentence representation evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 67–81, Brussels, Belgium, October 2018. Association for Computational Linguistics. doi: 10.18653/v1/D181007. URL https://aclanthology.org/D181007.

Raffel et al. (2020)
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou,
Y., Li, W., and Liu, P. J.
Exploring the limits of transfer learning with a unified texttotext transformer.
Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20074.html.  Ravichander et al. (2021) Ravichander, A., Belinkov, Y., and Hovy, E. Probing the probing paradigm: Does probing accuracy entail task relevance? In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 3363–3377, Online, April 2021. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2021.eaclmain.295.
 Rethmeier et al. (2020) Rethmeier, N., Saxena, V. K., and Augenstein, I. TXRay: Quantifying and explaining modelknowledge transfer in (un)supervised NLP. In Adams, R. P. and Gogate, V. (eds.), Proceedings of the ThirtySixth Conference on Uncertainty in Artificial Intelligence, pp. 197. AUAI Press, 2020. URL http://dblp.unitrier.de/db/conf/uai/uai2020.html#RethmeierSA20.
 Rogers et al. (2020) Rogers, A., Kovaleva, O., and Rumshisky, A. A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computational Linguistics, 8:842–866, 2020. doi: 10.1162/tacl_a_00349. URL https://www.aclweb.org/anthology/2020.tacl1.54.
 Rush (2020) Rush, A. TorchStruct: Deep structured prediction library. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 335–342, Online, July 2020. Association for Computational Linguistics. URL https://aclanthology.org/2020.acldemos.38.

Şahin et al. (2020)
Şahin, G. G., Vania, C., Kuznetsov, I., and Gurevych, I.
LINSPECTOR: Multilingual probing tasks for word
representations.
Computational Linguistics, 46(2):335–385,
2020.
doi: 10.1162/coli
_a
_00376. URL https://doi.org/10.1162/coli_a_00376. 
Tang et al. (2020)
Tang, G., Sennrich, R., and Nivre, J.
Understanding pure characterbased neural machine translation: The case of translating Finnish into English.
In Proceedings of the 28th International Conference on Computational Linguistics, pp. 4251–4262, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics. doi: 10.18653/v1/2020.colingmain.375. URL https://www.aclweb.org/anthology/2020.colingmain.375.  Tenney et al. (2018) Tenney, I., Xia, P., Chen, B., Wang, A., Poliak, A., McCoy, R. T., Kim, N., Durme, B. V., Bowman, S. R., Das, D., and Pavlick, E. What do you learn from context? Probing for sentence structure in contextualized word representations. In International Conference on Learning Representations, September 2018. URL https://openreview.net/forum?id=SJzSgnRcKX.
 Torroba Hennigen et al. (2020) Torroba Hennigen, L., Williams, A., and Cotterell, R. Intrinsic probing through dimension selection. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 197–216, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlpmain.15. URL https://www.aclweb.org/anthology/2020.emnlpmain.15.
 Vig et al. (2020) Vig, J., Gehrmann, S., Belinkov, Y., Qian, S., Nevo, D., Singer, Y., and Shieber, S. Investigating gender bias in language models using causal mediation analysis. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 12388–12401. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/92650b2e92217715fe312e6fa7b90d82Paper.pdf.
 Voita & Titov (2020) Voita, E. and Titov, I. Informationtheoretic probing with minimum description length. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 183–196, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlpmain.14. URL https://www.aclweb.org/anthology/2020.emnlpmain.14.
 Vulić et al. (2020) Vulić, I., Ponti, E. M., Litschko, R., Glavaš, G., and Korhonen, A. Probing pretrained language models for lexical semantics. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 7222–7240, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlpmain.586. URL https://www.aclweb.org/anthology/2020.emnlpmain.586.

Williams (1992)
Williams, R. J.
Simple statistical gradientfollowing algorithms for connectionist reinforcement learning.
Machine Learning, 8:229–256, 1992. URL https://link.springer.com/article/10.1007/BF00992696.  Wolf et al. (2020) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., and Brew, J. HuggingFace’s Transformers: Stateoftheart natural language processing. arXiv:1910.03771 [cs], February 2020.
 Zhang & Bowman (2018) Zhang, K. and Bowman, S. Language modeling teaches you more than translation does: Lessons learned through auxiliary syntactic task analysis. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 359–361, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/W185448. URL https://www.aclweb.org/anthology/W185448.
 Zou & Hastie (2005) Zou, H. and Hastie, T. Regularization and variable selection via the Elastic Net. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 67(2):301–320, 2005. ISSN 13697412. URL https://www.jstor.org/stable/3647580?seq=1#metadata_info_tab_contents.
Appendix A Variational Lower Bound
The derivation of the variational lower bound is shown below:
(11)  
(12)  
Appendix B List of Probed Morphosyntactic Attributes
The 29 language–attribute pairs we probe for in this work are listed below:

Arabic: Aspect, Case, Definiteness, Gender, Mood, Number, Voice

English: Number, Tense

Finnish: Case, Number, Person, Tense, Voice

Polish: Animacy, Case, Gender, Number, Tense

Portuguese: Gender, Number, Tense

Russian: Animacy, Aspect, Case, Gender, Number, Tense, Voice
Appendix C Training and Hyperparameter Tuning
We train our probes for a maximum of epochs using the Adam optimizer (Kingma & Ba, 2015). We add early stopping with a patience of as a regularization technique. Early stopping is conducted by holding out 10% of the training data; our development set is reserved for the greedy selection of subsets of neurons. Our implementation is built with PyTorch (Paszke et al., 2019). To execute a fair comparison with Dalvi et al. (2019), we train all probes other than the Gaussian probe using ElasticNet regularization (Zou & Hastie, 2005), which consists of combining both and regularization, where the regularizers are weighted by tunable regularization coefficients and , respectively. We follow the experimental setup proposed by Dalvi et al. (2019), where we set for all probes. In a preliminary experiment, we performed a grid search over these hyperparameters to confirm that the probe is not very sensitive to the tuning of these values (unless they are extreme) which aligns with the claim presented in Dalvi et al. (2019). For Gaussian, we take the MAP estimate, with a weak datadependent prior (Murphy, 2012, Chapter 4). In addition, we found that a slight improvement in the performance of Poisson and Conditional Poisson was obtained by scaling the entropy term in eq. 3 by a factor of .
Appendix D Supplementary Results
Tab. 3 compares the accuracy of our two models, Poisson and Conditional Poisson, to the Linear, Gaussian and Upper Bound baselines. The table reflects the trend observed in Tab. 2: Poisson and Conditional Poisson generally outperform the Linear baseline. However, Gaussian achieves highers accuracy with exception of a high dimension regimen.
Probe  

Cond. Poisson  
Poisson  
Linear  
Gaussian  
Cond. Poisson  
Poisson  
Upper Bound 