A Tale of a Probe and a Parser

Measuring what linguistic information is encoded in neural models of language has become popular in NLP. Researchers approach this enterprise by training "probes" - supervised models designed to extract linguistic structure from another model's output. One such probe is the structural probe (Hewitt and Manning, 2019), designed to quantify the extent to which syntactic information is encoded in contextualised word representations. The structural probe has a novel design, unattested in the parsing literature, the precise benefit of which is not immediately obvious. To explore whether syntactic probes would do better to make use of existing techniques, we compare the structural probe to a more traditional parser with an identical lightweight parameterisation. The parser outperforms structural probe on UUAS in seven of nine analysed languages, often by a substantial amount (e.g. by 11.1 points in English). Under a second less common metric, however, there is the opposite trend - the structural probe outperforms the parser. This begs the question: which metric should we prefer?


page 4

page 5


A Non-Linear Structural Probe

Probes are models devised to investigate the encoding of knowledge – e.g...

On the Pitfalls of Analyzing Individual Neurons in Language Models

While many studies have shown that linguistic information is encoded in ...

Pareto Probing: Trading Off Accuracy for Complexity

The question of how to probe contextual word representations in a way th...

Introducing Orthogonal Constraint in Structural Probes

With the recent success of pre-trained models in NLP, a significant focu...

Low-Complexity Probing via Finding Subnetworks

The dominant approach in probing neural networks for linguistic properti...

RPA: Probabilistic analysis of probe performance and robust summarization

Probe-level models have led to improved performance in microarray studie...

Does My Representation Capture X? Probe-Ably

Probing (or diagnostic classification) has become a popular strategy for...

1 Introduction

Recently, unsupervised sentence encoders such as ELMo (Peters et al., 2018) and (Devlin et al., 2019) have become popular within NLP. These pre-trained models boast impressive performance when used in many language-related tasks, but this gain has come at the cost of interpretability. A natural question to ask, then, is whether these models encode the traditional linguistic structures one might expect, such as part-of-speech tags or syntactic trees. To this end, researchers have invested in the design of diagnostic tools commonly referred to as probes Alain and Bengio (2017); Conneau et al. (2018); Hupkes et al. (2018); Poliak et al. (2018); Marvin and Linzen (2018); Niven and Kao (2019). Probes are supervised models designed to extract a target linguistic structure from the output representation learned by another model.

Based on the authors’ reading of the probing literature, there is little consensus on where to draw the line between probes and models for performing a target task (e.g. a part-of-speech tagger versus a probe for identifying part-of-speech). The main distinction appears to be one of researcher intent: probes are, in essence, a visualisation method (Hupkes et al., 2018). Their goal is not to best the state-of-the-art, but rather to indicate whether certain information is readily available in a model—probes should not “dig” for information, they should just expose what is already present. Indeed, a sufficiently expressive probe with enough training data could learn any task Hewitt and Liang (2019), but this tells us nothing about a representation, so it is beside the point. For this reason, probes are made “simple” (Liu et al., 2019), which usually means they are minimally parameterised.111An information-theoretic take on probe complexity is the subject of concurrent work; see pimentel-etal-2020-information.

Syntactic probes, then, are designed to measure the extent to which a target model encodes syntax. A popular example is the structural probe (Hewitt and Manning, 2019), used to compare the syntax that is decodable from different contextualised word embeddings. Rather than adopting methodology from the parsing literature, this probe utilises a novel approach for syntax extraction. However, the precise motivation for this novel approach is not immediately clear, since it has nothing to do with model complexity, and appears orthogonal to the goal of a probe. Probes are designed to help researchers understand what information exists in a model, and unfamiliar ways of measuring this information may obscure whether we are actually gaining an insight about the representation we wish to measure, or the tool of measurement itself.

Using the structural probe as a case study, we explore whether there is merit in designing models specifically for the purpose of probing—whether we should distinguish between the fundamental design

of probes and models for performing an equivalent task, as opposed to just comparing their simplicity. We pit the structural probe against a simple parser that has the exact same lightweight parameterisation, but instead employs a standard loss function for parsing. Experimenting on multiligual

(Devlin et al., 2019), we find that in seven of nine typologically diverse languages studied (Arabic, Basque, Czech, English, Finnish, Japanese, Korean, Tamil, and Turkish), the parser boosts UUAS dramatically (by, for example, points in English).

In addition to using UUAS, hewitt also introduce a new metric—correlation of pairwise distance predictions with the gold standard. We find that the structural probe outperforms the more traditional parser substantially in terms of this new metric, but it is unclear why this metric matters more than UUAS. In our discussion, we contend that, unless a convincing argument to the contrary is provided, traditional metrics are preferable. Justifying metric choice is of central importance for probing, lest we muddy the waters with a preponderance of ill-understood metrics.

2 Syntactic Probing Using Distance

Here we introduce syntactic distance, which we will later train a probe to approximate.

[theme=simple] [column sep=0.1cm]

My & displeasure & in & everything & displeases & me
& & & & &
12 43 [edge style=red,very thick]24 [edge style=red,very thick]25 56

Figure 1: Example of an undirected dependency tree. We observe that the syntactic distance between displeases and everything is (the red path).

Syntactic Distance

The syntactic distance between two words in a sentence is, informally, the number of steps between them in an undirected parse tree. Let be a sentence of length . A parse tree belonging to the sentence is an undirected spanning tree of vertices, each representing a word in the sentence . The syntactic distance between two words and , denoted , is defined as the shortest path from to in the tree where each edge has weight . Note that is a distance in the technical sense of the word: it is non-negative, symmetric, and satisfies the triangle inequality.

Tree Extraction

Converting from syntactic distance to a syntactic tree representation (or vice versa) is trivial and deterministic:

Proposition 1.

There is a bijection between syntactic distance and undirected spanning trees.


Suppose we have the syntactic distances for an unknown, undirected spanning tree . We may uniquely recover that tree by constructing a graph with an edge between and iff . (This analysis also holds if we have access to only the ordering of the distances between all pairs of words, rather than the perfect distance calculations—if that were the case, the minimum spanning tree could be computed e.g. with Prim’s.) On the other hand, if we have an undirected spanning tree and wish to recover the syntactic distances, we only need to compute the shortest path between each pair of words, with e.g. FloydWarshall, to yield uniquely. ∎

3 Probe, Meet Parser

In this section, we introduce a popular syntactic probe and a more traditional parser.

3.1 The Structural Probe

hewitt introduce a novel method for approximating the syntactic distance between any two words in a sentence. They christen their method the structural probe, since it is intended to uncover latent syntactic structure in contextual embeddings.222

In actual fact, the structural probe consists of two probes, one used to estimate the syntactic distance between words (which recovers an undirected graph) and another to calculate their depth in the tree (which is used to recover ordering). In this work, we focus exclusively on the former.

To do this, they define a parameterised distance function whose parameters are to be learned from data. For a word , let denote its contextual embedding, where is the dimensionality of the embeddings from the model we wish to probe, such as . Hewitt and Manning (2019) define the parameterised distance function


where is to be learned from data, and

is a user-defined hyperparameter. The matrix

is positive semi-definite and has rank at most .333To see this, let

be a vector. Then, we have that


The goal of the loss function, then, is to find such that the distance function best approximates . If we are to organise our training data into pairs, each consisting of a gold tree and the sentence it spans , we can then define the local loss function as


which is then averaged over the entire training set to create the following global loss function


Dividing the contribution of each local loss by the square of the length of its sentence (the

factor in the denominator) ensures that each sentence makes an equal contribution to the overall loss, to avoid a bias towards the effect of longer sentences. This global loss can be minimised computationally using stochastic gradient descent.

444 Hewitt and Manning found that replacing in eq. 2 with yielded better empirical results, so we do the same. For a discussion of this, refer to App. A.1 in Hewitt and Manning. coenen2019visualizing later offer a theoretical motivation, based on embedding trees in Euclidean space.

3.2 A Structured Perceptron Parser

Given that probe simplicity seemingly refers to parameterisation rather than the design of loss function, we infer that swapping the loss function should not be understood as increasing model complexity. With that in mind, here we describe an alternative to the structural probe which learns parameters for the same function

—a structured perceptron dependency parser, originally introduced in mcdonald-etal-2005-non.

This parser’s loss function works not by predicting every pairwise distance, but instead by predicting the tree based on the current estimation of the distances between each pair of words, then comparing the total weight of that tree to the total weight of the gold tree (based on the current distance predictions). The local perceptron loss is defined as


When the predicted minimum spanning tree perfectly matches the gold tree , each edge will cancel and this loss will equal zero. Otherwise, it will be positive, since the sum of the predicted distances for the edges in the gold tree will necessarily exceed the sum in the minimum spanning tree. Local loss is then trivially summed into a global loss as


This quantity can also be minimised with a (sub)gradient-based method.

Though both the structural probe and the structured perceptron parser may seem equivalent under Prop. 1, there is a subtle but important difference. To minimise the loss in eq. 2, the structural probe needs to encode (in ) the rank-ordering of the distances between each pair of words within a sentence. This is not necessarily the case for the structured perceptron. It could minimise the loss in eq. 4 by just encoding each pair of words as “near” or “far”—and Prim’s algorithm will do the rest.555One reviewer argued that, by injecting the tree constraint into the model in this manner, we lose the ability to answer the question of whether a probe discovered trees organically. While we believe this is valid, we do not see why the same criticism cannot be levelled against the structural probe—after all, it is trained on the same trees, just processed into pairwise distance matrices. The trees have been obfuscated, to be sure, but they remain in the data.

4 Experimental Setup

4.1 Processing Results

Embeddings and Data

We experimented on the contextual embeddings in the final hidden layer of the pre-trained multilingual release of (Devlin et al., 2019), and trained the models on the Universal Dependency (Nivre et al., 2016) treebands (v2.4). This allowed our analysis to be multilingual. More specifically, we consider eight typologically diverse languages (Arabic, Basque, Czech, Finnish, Japanese, Korean, Tamil, and Turkish), plus English.

Decoding the Predicted Trees

Having trained a model to find a that approximates , it is trivial to decode test sentences into trees (see Prop. 1). For an unseen sentence , we compute the pairwise distance matrix :


We can then compute the predicted tree from using Prim’s algorithm, which returns the minimum spanning tree from the predicted distances.

4.2 Experiments

(a) UUAS results
(b) DSpr results
Figure 2: Results for the metrics in Hewitt and Manning (2019): different metrics, opposite trends.

To compare the performance of the models, we used both metrics from Hewitt and Manning (2019), plus a new variant of the second.


The undirected unlabeled attachment score (UUAS) is a standard metric in the parsing literature, which reports the percentage of correctly identified edges in the predicted tree.


The second metric is the Spearman rank-order correlation between the predicted distances, which are output from , and the gold-standard distances (computable from the gold tree using the FloydWarshall algorithm). Hewitt and Manning term this metric distance Spearman (DSpr). While UUAS measures whether the model captures edges in the tree, DSpr considers pairwise distances between all leaves in the tree—even those which are not connected in a single step.


As a final experiment, we run DSpr again, but first pass each pairwise distance matrix through Prim’s (to recover the predicted tree) then through FloydWarshall (to recover a new distance matrix, with distances calculated based on the predicted tree). This post-processing would convert a “near”–“far” matrix encoding to a precise rank-order one. This should positively affect the results, in particular for the parser, since that is trained to predict trees which result from the pairwise distance matrix, not the pairwise distance matrix itself.

5 Results

This section presents results for the structural probe and structured perceptron parser.

UUAS Results

Figure 2 presents UUAS results for both models. The parser is the highest performing model on seven of the nine languages. In many of these the difference is substantial—in English, for instance, the parser outperforms the structural probe by UUAS points.666We used the UD treebanks rather than the Penn-Treebank (Marcus et al., 1993), and experimented on the final hidden layer of multilingual using a subset of 12,000 sentences from the larger treebanks. This renders our numbers incomparable to those found in hewitt.

DSpr Results

The DSpr results (Figure 2) show the opposite trend: the structural probe outperforms the parser on all languages. The parser performs particularly poorly on Japanese and Arabic, which is surprising, given that these had the second and third largest sets of training data for respectively (refer to Table 1 in the appendices). We speculate that this may be because in the treebanks used, Japanese and Arabic have a longer average sentence length than other languages.

Figure 3: DSpr results—DSpr following the application of Prim’s then FloydWarshall to .

DSpr Results

Following the post-processing step, the difference in DSpr (shown in Figure 3) is far less stark than previously suggested—the mean difference between the two across all nine languages is just 0.0006 (in favour of the parser). Notice in particular the improvement for both Arabic and Japanese—where previously (in the vanilla DSpr) the structured perceptron vastly underperformed, the post-processing step closes the gap almost entirely. Though Prop. 1 implies that we do not need to consider the full pairwise output of to account for global properties of tree, this is not totally borne out in our empirical findings, since we do not see the same trend in DSpr as we do in UUAS. If we recover the gold tree, we will have a perfect correlation with the true syntactic distance—but we do not always recover the gold tree (the UUAS is less than 100%), and therefore the errors the parser makes are pronounced.

6 Discussion: Probe v. Parser

Although we agree that probes should be somehow more constrained in their complexity than models designed to perform well on tasks, we see no reason why being a “probe” should necessitate fundamentally different design choices. It seems clear from our results that how you design a probe has a notable effect on the conclusions one might draw about a representation. Our parser was trained to recover trees (so it is more attuned to UUAS), whilst the structural probe was trained to recover pairwise distances (so it is more attuned to DSpr)—viewed this way, our results are not surprising in the least.

The fundamental question for probe designers, then, is which metric best captures a linguistic structure believed to be a property of a given representation—in this case, syntactic dependency. We suggest that probing research should focus more explicitly on this question—on the development and justification of probing metrics. Once a metric is established and well motivated, a lightweight probe can be developed to determine whether that structure is present in a model.

If proposing a new metric, however, the burden of proof lies with the researcher to articulate and demonstrate why it is worthwhile. Moreoever, this process of exploring which details a new metric is sensitive to (and comparing with existing metrics) ought not be conflated with an analysis of a particular model (e.g. )—it should be clear whether the results enable us to draw conclusions about a model, or about a means of analysing one.

For syntactic probing, there is certainly no a priori reason why one should prefer DSpr to UUAS. If anything, we tentatively recommend UUAS, pending further investigation. The DSpr results show no clear difference between the models, whereas UUAS exhibits a clear trend in favour of the parser, suggesting that it may be easier to recover pairwise distances from a good estimate of the tree than vice versa. UUAS also has the advantage that it is well described in the literature (and, in turn, well understood by the research community).

According to UUAS, existing methods were able to identify more syntax in than the structual probe. In this context, though, we use these results not to give kudos to , but to argue that the perceptron-based parser is a better tool for syntactic probing. Excluding differences in parameterisation, the line between what constitutes a probe or a model designed for a particular task is awfully thin, and when it comes to syntactic probing, a powerful probe seems to look a lot like a traditional parser.

7 Conclusion

We advocate for the position that, beyond some notion of model complexity, there should be no inherent difference between the design of a probe and a model designed for a corresponding task. We analysed the structural probe (Hewitt and Manning, 2019), and showed that a simple parser with an identical lightweight parameterisation was able to identify more syntax in in seven of nine compared languages under UUAS. However, the structural probe outperformed the parser on a novel metric proposed in hewitt, bringing to attention a broader question: how should one choose metrics for probing? In our discussion, we argued that if one is to propose a new metric, they should clearly justify its usage.


We thank John Hewitt for engaging wholeheartedly with our work and sharing many helpful insights.


  • G. Alain and Y. Bengio (2017)

    Understanding intermediate layers using linear classifier probes

    In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Workshop Track Proceedings, External Links: Link Cited by: §1.
  • A. Conneau, G. Kruszewski, G. Lample, L. Barrault, and M. Baroni (2018) What you can cram into a single $&!#* vector: probing sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 2126–2136. External Links: Document, Link Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Document, Link Cited by: §1, §1, §4.1.
  • R. W. Floyd (1962) Algorithm 97: Shortest path. Communications of the ACM 5 (6), pp. 345. External Links: Document, ISSN 0001-0782 Cited by: §2, §4.2, §4.2, Figure 3.
  • J. Hewitt and P. Liang (2019) Designing and interpreting probes with control tasks. In

    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

    Hong Kong, China, pp. 2733–2743. External Links: Link, Document Cited by: §1.
  • J. Hewitt and C. D. Manning (2019) A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4129–4138. External Links: Link, Document Cited by: A Tale of a Probe and a Parser, §1, §3.1, Figure 2, §4.2, §4.2, §7, footnote 4.
  • D. Hupkes, S. Veldhoen, and W. Zuidema (2018)

    Visualisation and ‘diagnostic classifiers’ reveal how recurrent and recursive neural networks process hierarchical structure


    Journal of Artificial Intelligence Research

    61, pp. 907–926.
    Cited by: §1, §1.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, External Links: Link Cited by: Appendix A.
  • N. F. Liu, M. Gardner, Y. Belinkov, M. E. Peters, and N. A. Smith (2019) Linguistic knowledge and transferability of contextual representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 1073–1094. External Links: Document, Link Cited by: §1.
  • M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz (1993) Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 19 (2), pp. 313–330. External Links: Link Cited by: footnote 6.
  • R. Marvin and T. Linzen (2018) Targeted syntactic evaluation of language models. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 1192–1202. External Links: Document, Link Cited by: §1.
  • T. Niven and H. Kao (2019) Probing neural network comprehension of natural language arguments. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4658–4664. External Links: Document, Link Cited by: §1.
  • J. Nivre, M. de Marneffe, F. Ginter, Y. Goldberg, J. Hajič, C. D. Manning, R. McDonald, S. Petrov, S. Pyysalo, N. Silveira, R. Tsarfaty, and D. Zeman (2016) Universal dependencies v1: a multilingual treebank collection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, pp. 1659–1666. External Links: Link Cited by: §4.1.
  • M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 2227–2237. External Links: Document, Link Cited by: §1.
  • A. Poliak, A. Haldar, R. Rudinger, J. E. Hu, E. Pavlick, A. S. White, and B. Van Durme (2018) Collecting diverse natural language inference problems for sentence representation evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 67–81. External Links: Document, Link Cited by: §1.
  • R. C. Prim (1957) Shortest connection networks and some generalizations. The Bell Systems Technical Journal 36 (6), pp. 1389–1401. Cited by: §2, §3.2, §3.2, §3.2, §4.1, §4.2, Figure 3.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: A simple way to prevent neural networks from overfitting.

    Journal of Machine Learning Research

    15, pp. 1929–1958.
    External Links: Link Cited by: Appendix A.
  • S. Warshall (1962) A theorem on boolean matrices. Journal of the ACM 9 (1), pp. 11–12. External Links: Link, Document Cited by: §2, §4.2, §4.2, Figure 3.
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, Ł. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean (2016)

    Google’s neural machine translation system: bridging the gap between human and machine translation

    arXiv preprint arXiv:1609.08144. External Links: Link Cited by: Appendix A.

Appendix A Training Details

For all models (separately for each language), we considered three hyperparameters: (full rank when , since this is the dimensionality of the embeddings), the learning rate, and the dropout Srivastava et al. (2014). To optimise these, we performed a random search, selecting values as judged by loss on the development set. When training, we used a batch size of 64 sentences, and employed early stopping after five steps based on loss reduction. As the optimiser, we used Adam Kingma and Ba (2015).

For each language, we used the largest available Universal Dependency 2.4 treebank. One-word sentences and sentences of over 50 words were discarded, and the larger treebanks were pruned to 12,000 sentences (in an 8:1:1 data split).

We used the implementation made by wolf2019hugging. Since accepts WordPiece units (Wu et al., 2016) rather than words, where necessary we averaged the output to get word-level embeddings. This is clearly a naïve composition method; improving it would likely strengthen the results for both the probe and the parser.

Appendix B Multilingual Details

The multilingual has 12 layers, 768 hidden states, and a total of 110M parameters. It was trained on the complete Wikipedia dumps for the 104 languages with the largest Wikipedias. Table 1 reports the size of the Wikipedias for the languages considered in this paper.777According to https://en.wikipedia.org/wiki/List_of_Wikipedias, sampled 24/10/19. Further details of the training can be found on Google Research’s GitHub.888https://github.com/google-research/bert/blob/master/multilingual.md

Language Articles
Arabic 1,016,152
Basque 342,426
Czech 439,467
English 5,986,229
Finnish 473,729
Japanese 1,178,594
Korean 476,068
Tamil 125,031
Turkish 336,380
Table 1: The number of articles in the Wikipedias of the languages considered.