Asking without Telling: Exploring Latent Ontologies in Contextual Representations

by   Julian Michael, et al.

The success of pretrained contextual encoders, such as ELMo and BERT, has brought a great deal of interest in what these models learn: do they, without explicit supervision, learn to encode meaningful notions of linguistic structure? If so, how is this structure encoded? To investigate this, we introduce latent subclass learning (LSL): a modification to existing classifier-based probing methods that induces a latent categorization (or ontology) of the probe's inputs. Without access to fine-grained gold labels, LSL extracts emergent structure from input representations in an interpretable and quantifiable form. In experiments, we find strong evidence of familiar categories, such as a notion of personhood in ELMo, as well as novel ontological distinctions, such as a preference for fine-grained semantic roles on core arguments. Our results provide unique new evidence of emergent structure in pretrained encoders, including departures from existing annotations which are inaccessible to earlier methods.


page 16

page 17


Grammatical Analysis of Pretrained Sentence Encoders with Acceptability Judgments

Recent pretrained sentence encoders achieve state of the art results on ...

Spying on your neighbors: Fine-grained probing of contextual embeddings for information about surrounding words

Although models using contextual word embeddings have achieved state-of-...

Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset

We present a new benchmark dataset, Sapsucker Woods 60 (SSW60), for adva...

Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis

This paper proposes a hierarchical, fine-grained and interpretable laten...

Improving Span-based Question Answering Systems with Coarsely Labeled Data

We study approaches to improve fine-grained short answer Question Answer...

The Low-Dimensional Linear Geometry of Contextualized Word Representations

Black-box probing models can reliably extract linguistic features like t...

1 Introduction

The success of self-supervised pretrained models in NLP Devlin et al. (2019); Peters et al. (2018a); Radford et al. (2019); Lan et al. (2020) has stimulated interest in how these models work, and—motivated by their strong performance on many tasks Wang et al. (2018, 2019)—what they learn about language. Recent work on model analysis (Belinkov and Glass, 2019) indicates that they may learn a lot about linguistic structure, including part of speech (Belinkov et al., 2017a), syntax (Blevins et al., 2018; Marvin and Linzen, 2018), word sense (Peters et al., 2018a; Reif et al., 2019), and more (Tenney et al., 2019; Liu et al., 2019).

Figure 1: LSL overview. A probing classifier over contextual embeddings produces multi-class latent logits

, which are marginalized into a single logit trained on binary classification. In this example, “Pierre Vinken” is identified as a named entity and assigned to latent class 2, which aligns well with the PERSON label. We treat the classes as clusters representing a latent ontology that describes the underlying representation space.

Figure 2 visualizes latent logits in more detail.

Many of these results are based on predictive methods, such as probing, which measure how well a linguistic variable can be predicted from intermediate representations. However, the ability of supervised probes to fit weak features makes it difficult to find unbiased answers about how those representations are structured (Saphra and Lopez, 2019; Voita et al., 2019). Descriptive methods like clustering and visualization explore this structure directly, but provide limited control and often regress to dominant categories such as lexical features (Singh et al., 2019) or word sense (Reif et al., 2019). This leaves open many questions: how are linguistic features like entity types, syntactic dependencies, or semantic roles represented by an encoder like ELMo (Peters et al., 2018a) or BERT (Devlin et al., 2019)? To what extent do familiar categories like PropBank roles or Universal Dependencies appear naturally? Do these unsupervised encoders learn their own categorization of language?

To tackle these questions, we propose a systematic method for extracting latent ontologies, or discrete categorizations of a representation space, which we call latent subclass learning; see Figure 1 for an overview. In LSL, we use a binary classification task (such as detecting entity mentions or syntactic dependency arcs) as weak supervision to induce a set of latent clusters relevant to that task (i.e., entity or dependency types). As with predictive methods, the choice of targets allows us to explore different phenomena, and the induced clusters can be quantified and measured against gold annotations. But also, as with descriptive methods, our clusters can be inspected and qualified directly, and observations have high specificity: agreement with external (e.g., gold) categories can be taken as strong evidence that those categories are salient in the representation space.

We describe the LSL classifier in Section 3, and apply it to the edge probing paradigm (Tenney et al., 2019) in Section 4. In Section 5 we evaluate LSL on multiple encoders, including ELMo and BERT. We find that LSL induces stable and consistent ontologies, which include both striking rediscoveries of gold categories—for example, ELMo discovers personhood of named entities and BERT similarly has a notion of dates—as well as novel ontological distinctions—such as fine-grained semantic roles for core arguments—which are not easily observed by fully supervised probes. Overall, we find unique new evidence of emergent latent structure in our encoders, while also revealing new properties of their representations which are inaccessible to earlier methods.

Figure 2:

Latent logit vectors from BERT (left) and ELMo (right) for a sample from the Named Entity labeling development set visualized in the Embedding Projector

(Smilkov et al., 2016) using UMAP (McInnes et al., 2018). Points are colored according to their gold label, and induced clusters are outlined in red. ELMo has a clear notion of personhood (PERSON), while BERT groups people with geopolitical entities (GPE) and nationalities (NORP). On the other hand, BERT strongly identifies dates (DATE) and organizations (ORG), and both models group numeric/quantitative entities together. Both models separate small CARDINAL numbers (roughly, seven or less) and group them with ORDINALs, separate from larger CARDINALs. The outlined areas in the bottom-right of the ELMo visualization include 2 and 4 induced clusters, respectively.

2 Background

Predictive analysis

A common form of model analysis is predictive: assessing how well a linguistic variable can be predicted from a model, whether in intrinsic behavioral tests (Goldberg, 2019; Marvin and Linzen, 2018) or extrinsic probing tasks.

Probing involves training lightweight classifiers over features produced by a pretrained model, and assessing the model’s knowledge by the probe’s performance. Probing has been used for low-level properties such as word order and sentence length (Adi et al., 2017; Conneau et al., 2018), as well as phenomena at the level of syntax (Hewitt and Manning, 2019), semantics (Tenney et al., 2019; Liu et al., 2019; Clark et al., 2019), and discourse structure (Chen et al., 2019). Error analysis on probes has been used to argue that BERT may simulate sequential decision making across layers (Tenney et al., 2019), or that it encodes its own, soft notion of syntactic distance (Reif et al., 2019).

Predictive methods such as probing are flexible: Any task with data can be assessed. However, they only track predictability of pre-defined categories, limiting their descriptive power. In addition, a powerful enough probe, given enough data, may be insensitive to differences between encoders, making it difficult to interpret results based on accuracy (Saphra and Lopez, 2019; Zhang and Bowman, 2018). So, many probing experiments appeal to the ease of extraction of a linguistic variable (Pimentel et al., 2020). Existing work has measured this by controlling for the capacity of the probe, either by making relative claims between layers and encoders (Belinkov et al., 2017b; Blevins et al., 2018; Tenney et al., 2019; Liu et al., 2019)

or using explicit measures to estimate and trade off probe capacity with accuracy

(Hewitt and Liang, 2019; Voita and Titov, 2020). An alternative is to control amount of supervision, whether by restricting training set size (Zhang and Bowman, 2018), comparing learning curves (Talmor et al., 2019), or using description length with online coding (Voita and Titov, 2020).

We extend this further by removing the distinction between gold categories in the training data and reducing the supervision to binary classification, as explained in Section 3. This extreme measure makes our test high specificity, in the sense that positive results—i.e., when comprehensible categories are recovered by our probe—are much stronger, since a category must be essentially invented without direct supervision.

Descriptive analysis

In contrast to predictive methods, which assess an encoder against particular data, descriptive methods analyze models on their own terms, and include clustering, visualization (Reif et al., 2019), and correlation analysis techniques (Voita et al., 2019; Saphra and Lopez, 2019; Abnar et al., 2019; Chrupała and Alishahi, 2019). Descriptive methods produce high-specificity tests of what structure is present in the model, and facilitate discovery of new patterns that weren’t hypothesized prior to testing. However, they lack the flexibility of predictive methods. Clustering results tend to be dominated by principal components of the embedding space, which correspond to only some salient aspects of linguistic knowledge, such as lexical features (Singh et al., 2019) and word sense (Reif et al., 2019). Alternatively, more targeted latent variable analysis techniques generally have a restricted inventory of inputs, such as layer mixing weights (Peters et al., 2018b) or transformer attention distributions (Clark et al., 2019). As a result of these issues, it is more difficult to discover the underlying structure corresponding to rich, layered ontologies. Our approach retains the advantages of descriptive methods, while admitting more control as the choice of binary classification targets can guide the LSL model to discover structure relevant to a particular linguistic task.

Linguistic ontologies

Questions of what encoders learn about language require well-defined linguistic ontologies, or meaningful categorizations of inputs, to evaluate against. Most analysis work uses formalisms from the classical NLP pipeline, such as part-of-speech and syntax from the Penn Treebank (Marcus et al., 1993) or Universal Dependencies (Nivre et al., 2015), semantic roles from PropBank (Palmer et al., 2005) or Dowty (1991)’s Proto-Roles (Reisinger et al., 2015), and named entities, which have a variety of available ontologies (Pradhan et al., 2007; Ling and Weld, 2012; Choi et al., 2018). Work on ontology-free, or open, representations suggests that the linguistic structure captured by traditional ontologies may be encoded in a variety of possible ways (Banko et al., 2007; He et al., 2015; Michael et al., 2018) while being annotatable at large scale (Fitzgerald et al., 2018). This raises the question: when looking for linguistic knowledge in pretrained encoders, what exactly should we expect to find? Predictive methods are useful for fitting an encoder to an existing ontology; but do our encoders latently hold their own ontologies as well? If so, what do they look like? That is the question we investigate in this work.

3 Approach

We propose a way to extract latent linguistic ontologies from pretrained encoders and systematically compare them to existing gold ontologies. We use a classifier based on latent subclass learning (Section 3.1), which is applicable in any binary classification setting.111A similar classifier was concurrently developed and presented for use in model distillation by Müller et al. (2020). We propose several quantitative metrics to evaluate the induced ontologies (Section 3.2), providing a starting point for qualitative analysis (Section 5) and future research.

3.1 Latent Subclass Learning

Consider a logistic regression classifier over inputs

. It outputs probabilities according to the following formula:

where is a learned parameter. Instead, we propose the latent subclass learning classifier:

where is a parameter matrix, and

is a hyperparameter corresponding to the number of latent classes.

This corresponds to +1-way multiclass logistic regression with a fixed 0 baseline for a null class, but trained on binary classification by marginalizing over the non-null classes (Figure 1). The vector may then be treated as a set of latent logits

for a random variable

defined by the softmax distribution. Taking the hard maximum of assigns a latent class to each input, which may be viewed as a weakly supervised clustering, learned on the basis of external supervision but not explicitly optimized to match prior gold categories.

For the loss , we use the cross-entropy loss on . However, this does not necessarily encourage a diverse, coherent set of clusters; an LSL classifier may simply choose to collapse all examples into a single category, producing an uninteresting ontology. To mitigate this, we propose two clustering regularizers.

Adjusted batch-level negative entropy

We wish for the model to induce a diverse ontology. One way to express this is that the expectation of has high entropy, i.e., we wish to maximize

In practice, we use the expectation over a batch. The maximum value this can take is the entropy of the uniform distribution over

items, or . Therefore, we wish to minimize the adjusted batch-level negative entropy loss:

which takes values in .

Instance-level entropy

In addition to using all latent classes in the expected case, we also wish for the model to assign a single coherent class label to each input example. This can be done by minimizing the instance-level entropy loss:

This also takes values in , and we compute the expectation over a batch.


We optimize the regularized LSL loss

where and are hyperparameters, via gradient descent. Together, the regularizers encourage a balanced solution where the model uses many clusters yet gives each input a distinct assignment.

Named Entities Universal Dependencies
P / R / F1 Acc. Div. Unc. P / R / F1 Acc. Div. Unc.
Gold 1.0 / 1.0 / 1.0 1.0 9.71 1.00 1.0 / 1.0 / 1.0 1.0 22.91 1.00
Multi .86 / .88 / .87 .94 8.58 1.88 .86 / .83 / .84 .93 21.94 1.77
LSL .28 / .80 / .41 .96 2.85 1.45 .10 / .60 / .18 .94 3.50 2.07
  +be .20 / .43 / .27 .96 4.78 31.23 .18 / .13 / .15 .94 29.83 12.33
  +ie .13 / 1.0 / .23 .93 1.00 1.00 .09 / .79 / .15 .94 2.00 1.01
  +be +ie .43 / .54 / .48 .88 7.00 1.10 .18 / .27 / .22 .86 14.96 1.35
Single .13 / 1.0 / .23 - 1.00 1.00 .06 / 1.0 / .11 - 1.00 1.00
Table 1: Model selection results over BERT-large. Multi is the standard multi-class model trained directly on gold labels, and Single is the degenerate single-cluster baseline. Our clustering regularizers (batch and/or instance-level entropy), when taken together, yield a good tradeoff between diversity and uncertainty, though at some expense to binary classification accuracy.

3.2 Metrics

For the following metrics, we consider only points in the gold positive class.


We compare induced ontologies to gold using the standard B-cubed (or B3) clustering metrics (Bagga and Baldwin, 1998)

. For each input point, this calculates the precision and recall of its predicted cluster against its gold cluster. These values are averaged over all points for aggregate scoring. B

3 is argued to have favorable properties (Amigó et al., 2009) and allows for label-wise scoring by restricting to points with specific gold labels.

Normalized PMI

Pointwise mutual information (PMI) is commonly used as an association measure reflecting how likely two items (such as tokens in a corpus) are to occur together relative to chance (Church and Hanks, 1990). Normalized PMI (nPMI; Bouma, 2009) is a way of factoring out the effect of item frequency on PMI. Formally, the nPMI of two items and is

taking the limit value of -1 when they never occur together, 1 when they only occur together, and 0 when they occur independently. We use nPMI to analyze the co-occurrence of gold labels in predicted clusters: high nPMI pairs are preferentially grouped together by the induced ontology, whereas low nPMI pairs are preferentially distinguished.


We desire fine-grained ontologies with many meaningful classes. Number of attested classes may not be a good measure of this, since it could include classes with very few members and no broad meaning. So we propose diversity:

This increases as the clustering becomes more fine-grained and evenly distributed, with a maximum of when is uniform. More generally, exponentiated entropy is sometimes referred to as the perplexity of a distribution, and corresponds (softly) to the number of classes required for a uniform distribution of the same entropy. In that sense, it may be regarded as the effective number of classes in an ontology. We use the predicted class rather than its distribution because we care about the diversity of the model’s clustering, and not just uncertainty in the model.


In order for our learned classes to be meaningful, we desire distinct and coherent clusters. To measure this, we propose uncertainty:

This is also related to perplexity, but unlike diversity, it takes the expectation over the input after calculating the perplexity of the distribution. This reflects how many classes, on average, the model is confused between when provided with an input. Low values correspond to coherent clusters, with a minimum of 1 when every latent class is assigned with full confidence. As with diversity, we take the expectation over the evaluation set.

BERT-lex ELMo BERT-large Gold
Task P / R / F1 Div. P / R / F1 Div. P / R / F1 Div. Div.
Dependencies .06 / .86 / .11 1.33 .23 / .42 / .29 11.11 .14 / .33 / .19 11.22 22.91
Named Entities .19 / .39 / .26 4.33 .40 / .66 / .50 5.07 .47 / .53 / .50 7.50 9.71
Nonterminals .22 / .80 / .34 1.47 .36 / .25 / .30 10.16 .35 / .34 / .35 7.80 7.15
Semantic Roles .19 / .39 / .26 2.81 .40 / .17 / .24 22.35 .37 / .17 / .24 18.70 8.73
Table 2: Results by task for three pretrained encoding methods. All probing models were trained with the LSL loss and cluster regularization coefficients , and chosen by the best-of-5 consistency criterion and detailed in Section 4.4. Uncertainty for all models was close to 1 and is omitted for space.

4 Experimental Setup

We adopt a similar setup to Tenney et al. (2019) and Liu et al. (2019), training probing models over several contextualizing encoders on a variety of linguistic tasks. While our interest is in linguistic structure, our model can be used in any binary classification setting, and our analysis methods apply in any case that finer-grained labels are present to compare against.

4.1 Tasks

We cast several structure labeling tasks from Tenney et al. (2019) as binary classification by adding negative examples, bringing the positive to negative ratio to 1:1 where possible.

Named entity labeling requires labeling noun phrases with entity types, such as person, location, date, or time. We randomly sample non-entity noun phrases as negatives.

Nonterminal labeling requires labeling phrase structure constituents with syntactic types, such as noun phrases and verb phrases. We randomly sample non-constituent spans as negatives.

Syntactic dependency labeling requires labeling token pairs with their syntactic relationship, such as a subject, direct object, or modifier. We randomly sample non-attached token pairs as negatives.

Semantic role labeling requires labeling predicates (usually verbs) and their arguments (usually syntactic constituents) with labels that abstract over syntactic relationships in favor of more semantic notions such as agent, patient, modifier roles involving e.g. time and place, or predicate-specific roles. We draw the closest non-attached predicate-argument pairs as negatives.

We use the English Web Treebank portion of Universal Dependencies 2.2 (Silveira et al., 2014) for syntactic dependencies, and the English portion of Ontonotes 5.0 (Weischedel et al., 2013) for all other tasks.

4.2 Encoders

We run experiments on the following encoders:

ELMo encodes input tokens with 2-layer LSTMs (Hochreiter and Schmidhuber, 1997) run forward and backward over the text, trained with a language modeling objective (Peters et al., 2018a). We use the publicly available trained on the One Billion Word Benchmark (Chelba et al., 2014).

BERT uses a deep Transformer stack Vaswani et al. (2017) trained on masked language modeling and next sentence prediction tasks (Devlin et al., 2019). We use the 24-layer BERT-large instance trained on about 2.3B tokens from English Wikipedia and BooksCorpus (Zhu et al., 2015); uncased_L-24_H-1024_A-16

BERT-lex is a lexical baseline, encoding inputs with BERT-large’s context-independent wordpiece embedding layer.

4.3 Probing Model

We use the model architecture of Tenney et al. (2019), which classifies arbitrary spans or pairs of spans by leveraging pretrained encoders in the following way: 1) construct token representations by pooling across encoder layers with a learned scalar mix (Peters et al., 2018a), 2) construct span representations from these token representations using self-attentive pooling (Lee et al., 2017)

, and 3) concatenate those span representations and feed the result into a multi-layer perceptron to produce input features for the classification layer. This architecture allows for a unified model for all probing tasks and simplifies our experiments. For the classification layer, we use the LSL classifier (

Section 3).

4.4 Model selection

We run initial studies to determine hidden layer sizes and regularization coefficients. For all LSL probes, we use latent classes.444Preliminary experiments found similar results for larger , with similar diversity in the full setting.

Probe capacity

Hewitt and Liang (2019) suggest that results with expressive probes may reflect the probe’s learning capacity rather than structure encoded in the inputs. To mitigate this, we follow their advice and use a single hidden layer with the smallest dimensionality that does not sacrifice performance. For each task, we train binary logistic regression probes with a range of hidden sizes and select the smallest yielding at least 97% of the best model’s performance. Details are in Appendix A.

Mitigating variance

To mitigate variance across random restarts, we use a consistency-based model selection criterion: train 5 separate models, compute their pairwise B

3 F1-scores, and choose the model with the highest F1 score on average.

Regularization coefficients

We run preliminary experiments using BERT-large on Universal Dependencies and Named Entity Labeling with ablations on our clustering regularizers. For each ablation, we choose the hyperparameter setting which yields the best F1 against gold.


Results, shown in Table 1, validate our intuitions about the clustering regularizers. The batch-level entropy loss drives up both diversity and uncertainty, while the instance-level entropy loss drives them down. In combination, however, they produce the right balance, with uncertainty close to 1 while retaining diversity.

Notably, the Named Entity labeling model has lower diversity without the instance-level loss than with it. Intuitively, this may happen because the batch-level entropy can be increased by driving up instance-level entropy, without changing the entropy of the expected distribution of predictions . So by keeping the uncertainty down on each input, the instance-level entropy loss helps the batch-level entropy loss promote diversity in the induced ontology.

Based on these results, we set for and for the main experiments.

5 Results and Analysis

We train and evaluate our final probing model on all combinations of task and encoder described in Section 4. Aggregate results are shown in Table 2.555Results for more tasks and encoders are in Appendix B. Taking all metrics into account, contextualized encodings produce richer ontologies that agree more with gold than the lexical baseline does. In fact, BERT-lex has normalized PMI scores very close to zero across the board, encoding virtually no information about gold categories. For this reason, we omit it from the rest of the analysis.

Gold Label BERT F1 ELMo F1
DATE .70 .38
PERCENT .60 .28
ORG .54 .35
PERSON .48 .81
EVENT .03 .02
LAW .02 .01
LANGUAGE .01 .01
Table 3: Label-wise B3 F1 scores for Named Entities, sorted by decreasing BERT-large F1. Induced ontologies capture some labels surprisingly well, but are indifferent to more specialized categories which may require more world knowledge to distinguish.

It may be surprising that our induced ontologies have any relationship at all to gold classes, since the only extra supervision is in binary classification that collapses them together. Indeed, many tasks addressed here have multiple human-written ontologies, as discussed in Section 2. In our case, we let the model choose its own ontology. The resulting matches and mismatches with human-labeled ontologies provide a new lens with which to analyze both pretrained encoders and linguistic ontologies.

Named entities

As shown in Table 3, neither BERT nor ELMo are sensitive to categories that are related to specialized world knowledge, such as languages, laws, and events. However, they are in tune with other types: ELMo discovers a clear PERSON category, whereas BERT has distinguished DATEs. Visualization of the clusters (Figure 2) corroborates this, furthermore showing that the models have a sense of scalar values and measurement; indeed, instead of the gold distinction between ORDINAL and CARDINAL numbers, both models distinguish between small and large (roughly, seven or greater) numbers. See Appendix C for detailed nPMI scores.

Gold Label P / R / F1
ARGM-MOD .62 / .41 / .49
ARG0 .52 / .17 / .26
ARG1 .50 / .09 / .15
ARGM-NEG .36 / .60 / .45
ARG2 .28 / .13 / .18
Table 4: Top semantic role labels by BERT-large B3 precision. Core arguments ARG0–2 are most preferentially split, with high precision but low recall.


Patterns in nPMI ((a)) suggest basic syntactic notions: complete clauses (S, TOP, SINV) form a group, as do phrase types which take subjects (SBAR, VP, PP), and wh-phrases (WHADVP, WHPP, WHNP).


Patterns in nPMI ((b)) indicate several salient groups: verb arguments (nsubj, obj, obl, xcomp), left-heads (det, nmod:poss, compound, amod, case), right-heads (acl, acl:relcl, nmod666Often the object in a prepositional phrase modifying a noun.), and punct.

Semantic roles

Patterns in nPMI ((c)) roughly match intuition: primary core arguments (ARG0, ARG1) are distinguished, as well as modals (ARGM-MOD) and negation (ARGM-NEG), while trailing arguments (ARG2–5) and modifiers (ARGM-TMP, LOC, etc.) form a large group. On one hand, this reflects surface patterns: primary core arguments tend to be close to the verb, with ARG0 on the left and ARG1 on the right; trailing arguments and modifiers tend to be prepositional phrases or subordinate clauses; and modals and negation are identified by lexical and positional cues. On the other hand, this also reflects error patterns in state-of-the-art systems, where label errors can sometimes be traced to ontological choices in PropBank, which distinguish between arguments and adjuncts that have very similar meaning (He et al., 2017; Kingsbury et al., 2002).

While number of induced classes roughly matches gold for most tasks, induced ontologies for semantic roles are considerably more diverse (Table 2). Among high-precision labels (Table 4), core arguments ARG0–2 are split apart most by the model. This follows intuition for PropBank core argument labels, which have predicate-specific meanings. Other approaches based on Frame Semantics (Baker et al., 1998; Fillmore and others, 2006), Proto-Roles (Dowty, 1991; Reisinger et al., 2015), or Levin classes (Levin, 1993; Schuler, 2005)

have more explicit fine-grained roles. Comparison with these frameworks and investigation of learned clusters could be informative for future work on ontology design or unsupervised learning.

(a) Nonterminals.
(b) Universal dependencies.
(c) Semantic roles.
Figure 3: Pairwise gold label nPMIs on selected categories for ontologies induced from BERT-large on selected tasks. Blue is positive nPMI, representing that gold labels are preferentially grouped together; Red is negative nPMI, representing that gold labels are preferentially separated. Counts are summed over all 5 runs to better reflect the underlying representations, though variance was low and our observed trends hold across all runs.

6 Discussion

Our exploration of latent ontologies has yielded some surprising results: ELMo knows people, BERT knows dates, and both sense scalar and measurable values, while distinguishing between small and large numbers. Both models preferentially split core semantic roles into many fine-grained categories, and seem to encode broad notions of syntactic and semantic structure. These findings contrast with those from fully-supervised probes, which produce strong agreement with existing annotations (Tenney et al., 2019) but can also report false positives by fitting to weak patterns in large feature spaces (Zhang and Bowman, 2018; Voita and Titov, 2020). Instead, agreement of latent categories with known concepts can be taken as strong evidence that these concepts are present as important, salient features in the representation space.

This issue is particularly important when looking for deep, inherent understanding of linguistic structure, which by nature must generalize. For supervised systems, generalization is often measured by out-of-distribution objectives like out-of-domain performance (Ganin et al., 2016), transferability (Wang et al., 2018), or robustness to adversarial inputs (Jia and Liang, 2017). Recent work also advocates for counterfactual learning and evaluation (Qin et al., 2019; Kaushik et al., 2020) to mitigate confounds, or contrastive evaluation sets Gardner et al. (2020) to rigorously test local decision boundaries. Overall, these techniques target discrepancies between salient features in a model and causal relationships in a task. In this work, we extract such features directly and investigate them by comparing induced and gold ontologies. This identifies some very strong cases of transferability from the binary detection task to detection tasks over gold subcategories, such as ELMo’s people and BERT’s dates (Table 3). Future work may investigate cross-task ontology matching to identify further cases of transferable features, or perhaps the emergence of categories signifying pipelined reasoning (Tenney et al., 2019), surface patterns, or new, perhaps unexpected distinctions which can appear when going beyond existing schemas (Michael et al., 2018).

Our results point to a general paradigm of probing with latent variables, for which LSL is just one potential technique. We have only scratched the surface of what may emerge with such methods: while our probing test is high specificity, it is low power; plenty of extant latent structure may still be missed. LSL probing may produce different ontologies due to many factors, such as tokenization (Singh et al., 2019), encoder architecture (Peters et al., 2018b), probe architecture (Hewitt and Manning, 2019), data distribution (Gururangan et al., 2018), pretraining task (Liu et al., 2019; Wang et al., 2019), or pretraining checkpoint. Any of these factors may be at work in the differences we observe between ELMo and BERT: for example, BERT’s tokenization method may not as readily induce personhood features due to splitting of rare words (like names) in byte-pair encoding. Furthermore, concurrent work (Chi et al., 2020) has already found qualitative evidence of syntactic dependency types emergent in the special case of multilingual structural probes (Hewitt and Manning, 2019). With LSL, we provide a method that can be adapted to a variety of probing settings to both quantify and qualify this kind of structure.

7 Conclusion

We introduced a new classifier and model analysis method based on latent subclass learning: By factoring a binary classifier through a forced choice of latent subclasses, latent ontologies can be coaxed out of input features. Using this approach, we found that encoders such as BERT and ELMo can be found to hold stable, consistent latent ontologies on a variety of linguistic tasks. In these ontologies, we found clear connections to existing categories, such as personhood of named entities. We also found evidence of ontological distinctions beyond traditional gold categories, such as distinguishing large and small numbers, or preferring fine-grained semantic roles for core arguments. With latent subclass learning, we have shown a general technique to uncover some of these features discretely, providing a starting point for descriptive analysis of our models’ latent ontologies. Potential future work may include investigating how LSL results vary with probe architecture, developing intrinsic quality measures on latent ontologies, or applying the technique to discover new patterns in settings where gold annotations are not present.


We would like to thank Tim Dozat, Kenton Lee, Emily Pitler, Kellie Webster, other members of the Google AI Language team, and Sewon Min, who all provided valuable feedback on this paper. We also thank Rafael Müller, Simon Kornblith, and Geoffrey Hinton for discussion on the LSL classifier.


  • S. Abnar, L. Beinborn, R. Choenni, and W. Zuidema (2019) Blackbox meets blackbox: representational similarity and stability analysis of neural language models and brains. In

    Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

    pp. 191–203. External Links: Document Cited by: §2.
  • Y. Adi, E. Kermany, Y. Belinkov, O. Lavi, and Y. Goldberg (2017) Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • E. Amigó, J. Gonzalo, J. Artiles, and F. Verdejo (2009)

    A comparison of extrinsic clustering evaluation metrics based on formal constraints

    Information retrieval 12 (4), pp. 461–486. Cited by: §3.2.
  • A. Bagga and B. Baldwin (1998) Entity-based cross-document core f erencing using the vector space model. In 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1, Montreal, Quebec, Canada, pp. 79–85. External Links: Link, Document Cited by: §3.2.
  • C. F. Baker, C. J. Fillmore, and J. B. Lowe (1998) The Berkeley FrameNet project. In Proceedings of the 17th international conference on Computational linguistics-Volume 1, pp. 86–90. Cited by: §5.
  • M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni (2007) Open information extraction from the web. In Proceedings of the 20th International Joint Conference on Artifical Intelligence, IJCAI’07, San Francisco, CA, USA, pp. 2670–2676. External Links: Link Cited by: §2.
  • Y. Belinkov, N. Durrani, F. Dalvi, H. Sajjad, and J. Glass (2017a)

    What do neural machine translation models learn about morphology?

    In Proceedings of EMNLP, Cited by: §1.
  • Y. Belinkov and J. Glass (2019) Analysis methods in neural language processing: a survey. Transactions of the Association for Computational Linguistics (TACL) 7, pp. 49–72. External Links: Document Cited by: §1.
  • Y. Belinkov, L. Màrquez, H. Sajjad, N. Durrani, F. Dalvi, and J. Glass (2017b) Evaluating layers of representation in neural machine translation on part-of-speech and semantic tagging tasks. In Proceedings of IJCNLP, Cited by: §2.
  • T. Blevins, O. Levy, and L. Zettlemoyer (2018) Deep RNNs encode soft hierarchical syntax. In Proceedings of ACL, Cited by: §1, §2.
  • G. Bouma (2009) Normalized (pointwise) mutual information in collocation extraction. In GSCL, Cited by: §3.2.
  • C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, and T. Robinson (2014) One billion word benchmark for measuring progress in statistical language modeling. In Proceedings of Interspeech, Cited by: §4.2.
  • M. Chen, Z. Chu, and K. Gimpel (2019) Evaluation benchmarks and learning criteria for discourse-aware sentence representations. In

    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

    Hong Kong, China, pp. 649–662. External Links: Link, Document Cited by: §2.
  • E. A. Chi, J. Hewitt, and C. D. Manning (2020) Finding universal grammatical relations in multilingual bert. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Cited by: §6.
  • E. Choi, O. Levy, Y. Choi, and L. Zettlemoyer (2018) Ultra-fine entity typing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 87–96. External Links: Link, Document Cited by: §2.
  • G. Chrupała and A. Alishahi (2019) Correlating neural and symbolic representations of language. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2952–2962. External Links: Link, Document Cited by: §2.
  • K. W. Church and P. Hanks (1990) Word association norms, mutual information, and lexicography. Computational Linguistics 16 (1), pp. 22–29. External Links: Link Cited by: §3.2.
  • K. Clark, U. Khandelwal, O. Levy, and C. D. Manning (2019) What does bert look at? an analysis of bert’s attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Cited by: §2, §2.
  • A. Conneau, G. Kruszewski, G. Lample, L. Barrault, and M. Baroni (2018) What you can cram into a single $&#* vector: probing sentence embeddings for linguistic properties. In Proceedings of ACL, Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §1, §1, §4.2.
  • D. Dowty (1991) Thematic proto-roles and argument selection. Language 67, pp. 547–619. Cited by: §2, §5.
  • C. J. Fillmore et al. (2006) Frame semantics. Cognitive linguistics: Basic readings 34, pp. 373–400. Cited by: §5.
  • N. Fitzgerald, J. Michael, L. He, and L. S. Zettlemoyer (2018) Large-scale QA-SRL parsing. In ACL, Cited by: §2.
  • Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016) Domain-adversarial training of neural networks.

    The Journal of Machine Learning Research

    17 (1), pp. 2096–2030.
    Cited by: §6.
  • M. Gardner, Y. Artzi, V. Basmova, J. Berant, B. Bogin, S. Chen, P. Dasigi, D. Dua, Y. Elazar, A. Gottumukkala, N. Gupta, H. Hajishirzi, G. Ilharco, D. Khashabi, K. Lin, J. Liu, N. F. Liu, P. Mulcaire, Q. Ning, S. Singh, N. A. Smith, S. Subramanian, R. Tsarfaty, E. Wallace, A. Zhang, and B. Zhou (2020) Evaluating nlp models via contrast sets. External Links: 2004.02709 Cited by: §6.
  • Y. Goldberg (2019) Assessing bert’s syntactic abilities. arXiv preprint arXiv:1901.05287. Cited by: §2.
  • S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. Bowman, and N. A. Smith (2018) Annotation artifacts in natural language inference data. In Proceedings of NAACL, Vol. 2, pp. 107–112. Cited by: §6.
  • L. He, K. Lee, M. Lewis, and L. Zettlemoyer (2017) Deep semantic role labeling: what works and what’s next. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 473–483. External Links: Link, Document Cited by: §5.
  • L. He, M. Lewis, and L. S. Zettlemoyer (2015) Question-answer driven semantic role labeling: using natural language to annotate natural language. In EMNLP, Cited by: §2.
  • J. Hewitt and P. Liang (2019) Designing and interpreting probes with control tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 2733–2743. External Links: Link, Document Cited by: §2, §4.4.
  • J. Hewitt and C. D. Manning (2019) A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4129–4138. External Links: Link, Document Cited by: §2, §6.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §4.2.
  • R. Jia and P. Liang (2017) Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 2021–2031. External Links: Link, Document Cited by: §6.
  • D. Kaushik, E. Hovy, and Z. Lipton (2020) Learning the difference that makes a difference with counterfactually-augmented data. In International Conference on Learning Representations, External Links: Link Cited by: §6.
  • P. Kingsbury, M. Palmer, and M. Marcus (2002) Adding semantic annotation to the penn treebank. In In Proceedings of the Human Language Technology Conference, Cited by: §5.
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2020) ALBERT: a lite bert for self-supervised learning of language representations. In International Conference on Learning Representations, External Links: Link Cited by: Appendix B, §1.
  • K. Lee, L. He, M. Lewis, and L. Zettlemoyer (2017) End-to-end neural coreference resolution. In Proceedings of EMNLP, Cited by: §4.3.
  • B. Levin (1993) English verb classes and alternations: a preliminary investigation. University of Chicago Press. External Links: ISBN 9780226475332, LCCN lc92042504, Link Cited by: §5.
  • X. Ling and D. S. Weld (2012) Fine-grained entity recognition. In

    Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence

    AAAI’12, pp. 94–100. External Links: Link Cited by: §2.
  • N. F. Liu, M. Gardner, Y. Belinkov, M. E. Peters, and N. A. Smith (2019) Linguistic knowledge and transferability of contextual representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 1073–1094. External Links: Link, Document Cited by: §1, §2, §4, §6.
  • X. Liu, P. He, W. Chen, and J. Gao (2019) Multi-task deep neural networks for natural language understanding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4487–4496. External Links: Link, Document Cited by: §2.
  • M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini (1993) Building a large annotated corpus of english: the penn treebank. Computational linguistics 19 (2), pp. 313–330. Cited by: §2.
  • R. Marvin and T. Linzen (2018) Targeted syntactic evaluation of language models. In Proceedings of EMNLP, Cited by: §1, §2.
  • L. McInnes, J. Healy, and J. Melville (2018) Umap: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426. Cited by: Figure 2.
  • J. Michael, G. Stanovsky, L. He, I. Dagan, and L. S. Zettlemoyer (2018) Crowdsourcing question-answer meaning representations. In NAACL-HLT, Cited by: §2, §6.
  • R. Müller, S. Kornblith, and G. Hinton (2020) Subclass distillation. External Links: 2002.03936 Cited by: footnote 1.
  • J. Nivre, Ž. Agić, M. J. Aranzabe, M. Asahara, A. Atutxa, M. Ballesteros, J. Bauer, K. Bengoetxea, R. A. Bhat, C. Bosco, S. Bowman, G. G. A. Celano, M. Connor, M. de Marneffe, A. Diaz de Ilarraza, K. Dobrovoljc, T. Dozat, T. Erjavec, R. Farkas, J. Foster, D. Galbraith, F. Ginter, I. Goenaga, K. Gojenola, Y. Goldberg, B. Gonzales, B. Guillaume, J. Hajič, D. Haug, R. Ion, E. Irimia, A. Johannsen, H. Kanayama, J. Kanerva, S. Krek, V. Laippala, A. Lenci, N. Ljubešić, T. Lynn, C. Manning, C. Mărănduc, D. Mareček, H. Martínez Alonso, J. Mašek, Y. Matsumoto, R. McDonald, A. Missilä, V. Mititelu, Y. Miyao, S. Montemagni, S. Mori, H. Nurmi, P. Osenova, L. Øvrelid, E. Pascual, M. Passarotti, C. Perez, S. Petrov, J. Piitulainen, B. Plank, M. Popel, P. Prokopidis, S. Pyysalo, L. Ramasamy, R. Rosa, S. Saleh, S. Schuster, W. Seeker, M. Seraji, N. Silveira, M. Simi, R. Simionescu, K. Simkó, K. Simov, A. Smith, J. Štěpánek, A. Suhr, Z. Szántó, T. Tanaka, R. Tsarfaty, S. Uematsu, L. Uria, V. Varga, V. Vincze, Z. Žabokrtský, D. Zeman, and H. Zhu (2015) Universal dependencies 1.2. Note: LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University External Links: Link Cited by: Appendix B, §2.
  • M. Palmer, D. Gildea, and P. Kingsbury (2005) The proposition bank: an annotated corpus of semantic roles. Computational linguistics 31 (1), pp. 71–106. Cited by: §2.
  • M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018a) Deep contextualized word representations. In Proceedings of NAACL, Cited by: §1, §1, §4.2, §4.3.
  • M. Peters, M. Neumann, L. Zettlemoyer, and W. Yih (2018b) Dissecting contextual word embeddings: architecture and representation. In Proceedings of EMNLP, Cited by: §2, §6.
  • T. Pimentel, J. Valvoda, R. H. Maudslay, R. Zmigrod, A. Williams, and R. Cotterell (2020) Information-theoretic probing for linguistic structure. In Proceedings of the 58th annual meeting of the Association for Computational Linguistics, Cited by: §2.
  • S. S. Pradhan, E. Hovy, M. Marcus, M. Palmer, L. Ramshaw, and R. Weischedel (2007) Ontonotes: a unified relational semantic representation. International Journal of Semantic Computing 1 (04), pp. 405–419. Cited by: Appendix B, §2.
  • L. Qin, A. Bosselut, A. Holtzman, C. Bhagavatula, E. Clark, and Y. Choi (2019) Counterfactual story reasoning and generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 5043–5053. External Links: Link, Document Cited by: §6.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. Cited by: §1.
  • E. Reif, A. Yuan, M. Wattenberg, F. B. Viegas, A. Coenen, A. Pearce, and B. Kim (2019) Visualizing and measuring the geometry of bert. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8592–8600. External Links: Link Cited by: §1, §1, §2, §2.
  • D. Reisinger, R. Rudinger, F. Ferraro, C. Harman, K. Rawlins, and B. Van Durme (2015) Semantic proto-roles. Transactions of the Association of Computational Linguistics. Cited by: §2, §5.
  • N. Saphra and A. Lopez (2019) Understanding learning dynamics of language models with SVCCA. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 3257–3267. External Links: Link, Document Cited by: §1, §2, §2.
  • K. K. Schuler (2005)

    VerbNet: a broad-coverage, comprehensive verb lexicon

    Ph.D. Thesis, University of Pennsylvania. Cited by: §5.
  • N. Silveira, T. Dozat, M. de Marneffe, S. Bowman, M. Connor, J. Bauer, and C. D. Manning (2014) A gold standard dependency corpus for English. In Proceedings of the Ninth International Conference on Language Resources and Evaluation, Cited by: §4.1.
  • J. Singh, B. McCann, R. Socher, and C. Xiong (2019) BERT is not an interlingua and the bias of tokenization. In

    Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019)

    Hong Kong, China, pp. 47–55. External Links: Link, Document Cited by: §1, §2, §6.
  • D. Smilkov, N. Thorat, C. Nicholson, E. Reif, F. B. Viégas, and M. Wattenberg (2016) Embedding projector: interactive visualization and interpretation of embeddings. In NIPS 2016 Workshop on Interpretable Machine Learning in Complex Systems, Cited by: Figure 2.
  • A. Talmor, Y. Elazar, Y. Goldberg, and J. Berant (2019) OLMpics – on what language model pre-training captures. External Links: 1912.13283 Cited by: §2.
  • I. Tenney, D. Das, and E. Pavlick (2019) BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4593–4601. External Links: Link, Document Cited by: §2, §6.
  • I. Tenney, P. Xia, B. Chen, A. Wang, A. Poliak, R. T. McCoy, N. Kim, B. V. Durme, S. Bowman, D. Das, and E. Pavlick (2019) What do you learn from context? probing for sentence structure in contextualized word representations. In International Conference on Learning Representations, Cited by: §1, §1, §2, §2, §4.1, §4.3, §4, §6.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Proceedings of NIPS, Cited by: §4.2.
  • E. Voita, R. Sennrich, and I. Titov (2019) The bottom-up evolution of representations in the transformer: a study with machine translation and language modeling objectives. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4387–4397. Cited by: §1, §2.
  • E. Voita and I. Titov (2020) Information-theoretic probing with minimum description length. External Links: 2003.12298 Cited by: §2, §6.
  • A. Wang, J. Hula, P. Xia, R. Pappagari, R. T. McCoy, R. Patel, N. Kim, I. Tenney, Y. Huang, K. Yu, S. Jin, B. Chen, B. Van Durme, E. Grave, E. Pavlick, and S. R. Bowman (2019) Can you tell me how to get past sesame street? sentence-level pretraining beyond language modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4465–4476. External Links: Link, Document Cited by: §6.
  • A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2019) SuperGLUE: a multi-task benchmark and analysis platform for natural language understanding. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 3261–3275. External Links: Link Cited by: §1.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman (2018) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 353–355. Cited by: §1, §6.
  • R. Weischedel, M. Palmer, M. Marcus, E. Hovy, S. Pradhan, L. Ramshaw, N. Xue, A. Taylor, J. Kaufman, M. Franchini, et al. (2013) OntoNotes release 5.0 LDC2013T19. Linguistic Data Consortium, Philadelphia, PA. Cited by: §4.1.
  • K. Zhang and S. Bowman (2018) Language modeling teaches you more than translation does: lessons learned through auxiliary syntactic task analysis. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 359–361. Cited by: §2, §6.
  • Y. Zhang, V. Zhong, D. Chen, G. Angeli, and C. D. Manning (2017) Position-aware attention and supervised data improve slot filling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017), pp. 35–45. External Links: Link Cited by: Appendix B.
  • Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler (2015) Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In

    Proceedings of the IEEE international conference on computer vision

    pp. 19–27. Cited by: §4.2.

Appendix A Probe capacity tuning

Results from hidden size tuning experiments are shown in Figure 4.

Figure 4: Performance on hidden size tuning experiments for different tasks. Clockwise from top-left, they are nonterminals, named entities, semantic roles, and syntactic dependencies. coarse (red) is binary accuracy of a binary classifier, fine-binary (blue) is binary accuracy of a full multiclass classifier, and fine-full (green) is the full multiclass accuracy of the multiclass classifier. The black vertical line is the smallest hidden size that passes the 97% performance threshold for coarse.

Appendix B Full Experimental Results

We ran on several additional encoders and tasks. Extra tasks include undirected Universal Dependencies (Nivre et al., 2015), TAC relation classification (Zhang et al., 2017), and coreference on OntoNotes (Pradhan et al., 2007). Extra encoders include BERT-base, mBERT777 and ALBERT Lan et al. (2020). Full results are shown in Tables 1–7.

Appendix C More Analysis Results

We show expanded comparative nPMI plots in Figure 5 and Figure 6. These use co-occurrence counts summed over 5 runs, and exhibit the same overall trends as each run.

P R F1 Acc. Diversity Uncertainty
Gold 1.00 1.00 1.00 1.00 9.71 1.00
ELMo 0.40 0.66 0.50 0.83 5.07 1.08
BERT-base 0.43 0.57 0.49 0.88 6.09 1.11
BERT-large 0.47 0.53 0.50 0.86 7.50 1.10
mBERT 0.25 0.67 0.37 0.84 3.29 1.06
ALBERT-large 0.38 0.53 0.44 0.89 6.00 1.15
BERT-large (lex) 0.19 0.39 0.26 0.74 4.33 1.13
Table 5: Results by encoder for OntoNotes named entity labeling.
P R F1 Acc. Diversity Uncertainty
Gold 1.00 1.00 1.00 1.00 7.15 1.00
ELMo 0.36 0.25 0.30 0.58 10.16 1.12
BERT-base 0.36 0.41 0.38 0.60 5.76 1.06
BERT-large 0.35 0.34 0.35 0.61 7.80 1.06
mBERT 0.36 0.34 0.35 0.59 7.38 1.06
ALBERT-large 0.38 0.28 0.32 0.59 9.07 1.08
BERT-large (lex) 0.22 0.80 0.34 0.50 1.47 1.26
Table 6: Results by encoder for OntoNotes nonterminal labeling.
P R F1 Acc. Diversity Uncertainty
Gold 1.00 1.00 1.00 1.00 22.91 1.00
ELMo 0.23 0.42 0.29 0.67 11.11 1.22
BERT-base 0.13 0.34 0.19 0.76 9.69 1.23
BERT-large 0.14 0.33 0.19 0.77 11.22 1.23
mBERT 0.27 0.51 0.35 0.73 9.40 1.22
ALBERT-large 0.23 0.41 0.29 0.72 9.84 1.20
BERT-large (lex) 0.06 0.86 0.11 0.50 1.33 1.02
Table 7: Results by encoder for Universal Dependency labeling.
P R F1 Acc. Diversity Uncertainty
Gold 1.00 1.00 1.00 1.00 22.91 1.00
ELMo 0.19 0.23 0.21 0.71 19.12 1.14
BERT-base 0.27 0.24 0.25 0.85 22.79 1.20
BERT-large 0.23 0.23 0.23 0.82 18.51 1.17
mBERT 0.24 0.20 0.21 0.83 20.31 1.19
ALBERT-large 0.30 0.27 0.28 0.81 20.53 1.14
BERT-large (lex) 0.09 0.54 0.16 0.50 3.39 1.00
Table 8: Results by encoder for undirected Universal Dependency labeling.
P R F1 Acc. Diversity Uncertainty
Gold 1.00 1.00 1.00 1.00 8.73 1.00
ELMo 0.40 0.17 0.24 0.76 22.35 1.08
BERT-base 0.39 0.18 0.25 0.86 21.95 1.15
BERT-large 0.37 0.17 0.24 0.88 18.70 1.15
mBERT 0.41 0.21 0.28 0.88 19.05 1.12
ALBERT-large 0.43 0.21 0.28 0.87 19.90 1.12
BERT-large (lex) 0.19 0.39 0.26 0.46 2.81 1.01
Table 9: Results by encoder for OntoNotes semantic role labeling.
P R F1 Acc. Diversity Uncertainty
Gold 1.00 1.00 1.00 1.00 1.00 1.00
ELMo 1.00 0.09 0.16 0.80 14.22 1.18
BERT-base 1.00 0.09 0.16 0.86 14.67 1.24
BERT-large 1.00 0.09 0.17 0.87 15.57 1.27
mBERT 1.00 0.09 0.16 0.83 13.86 1.24
ALBERT-large 1.00 0.09 0.16 0.86 13.56 1.26
BERT-large (lex) 1.00 0.78 0.87 0.78 1.60 1.03
Table 10: Results by encoder for OntoNotes coreference.
P R F1 Acc. Diversity Uncertainty
Gold 1.00 1.00 1.00 1.00 24.78 1.00
ELMo 0.11 0.78 0.20 0.77 2.38 1.05
BERT-base 0.11 0.90 0.20 0.76 1.94 1.05
BERT-large 0.16 0.63 0.25 0.80 3.87 1.11
mBERT 0.15 0.87 0.26 0.76 2.21 1.05
BERT-large (lex) 0.07 0.97 0.13 0.76 1.11 1.02
Table 11: Results by encoder for TAC relation classification.
(a) Pairwise nPMIs for selected named entity classes in ontologies induced on BERT-large (left) and ELMo (right).
(b) Pairwise nPMIs for selected nonterminal classes in ontologies induced on BERT-large (left) and ELMo (right).
Figure 5: Pairwise nPMI charts for named entities and nonterminals.
(a) Pairwise nPMIs for selected named universal dependency labels in ontologies induced on BERT-large (left) and ELMo (right).
(b) Pairwise nPMIs for selected semantic roles in ontologies induced on BERT-large (left) and ELMo (right).
Figure 6: Pairwise nPMI charts for syntactic dependencies and semantic roles.