Quantifying the Contextualization of Word Representations with Semantic Class Probing

04/25/2020 ∙ by Mengjie Zhao, et al. ∙ Universität München 0

Pretrained language models have achieved a new state of the art on many NLP tasks, but there are still many open questions about how and why they work so well. We investigate the contextualization of words in BERT. We quantify the amount of contextualization, i.e., how well words are interpreted in context, by studying the extent to which semantic classes of a word can be inferred from its contextualized embeddings. Quantifying contextualization helps in understanding and utilizing pretrained language models. We show that top layer representations achieve high accuracy inferring semantic classes; that the strongest contextualization effects occur in the lower layers; that local context is mostly sufficient for semantic class inference; and that top layer representations are more task-specific after finetuning while lower layer representations are more transferable. Finetuning uncovers task related features, but pretrained knowledge is still largely preserved.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Pretrained language models like ELMo (Peters et al., 2018a), BERT (Devlin et al., 2019) and XLNet (Yang et al., 2019) are top performers in NLP because they learn contextualized representations, i.e., representations that reflect the interpretation of a word in context as opposed to its general meaning, which is less helpful in solving NLP tasks.

While what we just stated – pretrained language models contextualize words – is clear qualitatively, there has been little work on investigating contextualization, i.e., to which extent a word can be interpreted in context, quantitatively.

We use BERT (Devlin et al., 2019) as our pretrained language model and quantify contextualization by investigating how well BERT infers semantic classes (s-classes) of a word in context, e.g., the s-class organization for “Apple” in “Apple stock rises” vs. the s-class food in “Apple juice is healthy”. We use s-class inference as a proxy for contextualization since accurate s-class inference reflects a successful contextualization of a word – an effective interpretation of the word in context.

We adopt the methodology of probing (Adi et al., 2016; Shi et al., 2016; Belinkov et al., 2017; Liu et al., 2019; Tenney et al., 2019b; Belinkov and Glass, 2019; Yaghoobzadeh et al., 2019)

: diagnostic classifiers are applied to pretrained language model embeddings to determine whether they encode desired syntactic or semantic features. By probing for s-classes we can quantify directly where and how contextualization happens in BERT. E.g., we find that the strongest contextual interpretation effects occur in the lower layers and that the top two layers contribute little to contextualization. We also investigate how the amount of context available and finetuning affect contextualization.

We make the following contributions: (i) We investigate how accurately BERT interprets words in context. Our quantitative methodology is complementary to existing qualitative interpretations (Tenney et al., 2019a)

and cosine-similarity-based intrinsic evaluations

(Ethayarajh, 2019) for investigating the “contextual extent” of word representations. We find that BERT’s performance is high (almost 85% ), but that there is still room for improvement. (ii) We quantify how much each additional layer contributes in BERT. We find that the strongest contextual interpretation effects occur in the lower layers. The top two layers seem to be optimized only for the pretraining objective of predicting masked words (Devlin et al., 2019) and only add small increments to contextualization. (iii) We investigate the amount of context BERT can exploit for interpreting a word and find that BERT effectively integrates local context up to five words to the left and to the right (a 10-word context window). (iv) We investigate the dynamics of BERT’s representations in finetuning. We find that finetuning has little effect on lower layers, suggesting that they are more easily transferable across tasks. Higher layers are strongly changed for a nonsemantic finetuning objective, but little for a semantic task like paraphrase classification. Finetuning uncovers task-related features, but the knowledge captured in pretraining is largely preserved. We quantify these effects by s-class inference performance.

suits suits
lawsuit suited
filed lawsuit
lawsuits ##suit
sued lawsuits
complaint slacks
jacket 47th
Table 1: Nearest neighbors of “suit” in GloVe and wordpiece embeddings of BERT (bert-base-uncased).

2 Motivation and Methodology

The key benefit of pretrained language models (McCann et al., 2017; Peters et al., 2018a; Radford et al., 2019; Devlin et al., 2019) is that they produce contextualized embeddings that are useful in NLP. The top layer contextualized word representations from pretrained language models are widely utilized; however, the fact that pretrained language models implement a process of contextualization – starting with a completely uncontextualized layer of wordpieces at the bottom – is not well studied. Table 1 gives an example: BERT’s wordpiece embedding of “suit” is not contextualized: it contains several meanings of the word, including “to suit” (“be convenient”), lawsuit and clothes (“slacks”). Thus, there is no difference in this respect between BERT’s wordpiece embeddings and standard uncontextualized word embeddings like GloVe (Pennington et al., 2014). Pretrained language models start out with an uncontextualized representation at the lowest layer, then gradually contextualize it. This is the process we analyze in this paper.

For investigating the contextualization process, one possibility is to use word senses and to tap resources like the WordNet (WN) (Fellbaum, 1998) based word sense disambiguation benchmarks of the Senseval series (Edmonds and Cotton, 2001; Snyder and Palmer, 2004; Raganato et al., 2017). However, the abstraction level in WN sense inventories has been criticized as too fine-grained (Izquierdo et al., 2009), providing limited information to applications requiring higher level abstraction. Various levels of granularity of abstraction have been explored such as WN domains (Magnini and Cavaglià, 2000), supersenses (Ciaramita and Johnson, 2003; Levine et al., 2019) and basic level concepts (Beviá et al., 2007). In this paper, we use semantic classes (s-classes) (Yarowsky, 1992; Resnik, 1993; Kohomban and Lee, 2005; Yaghoobzadeh et al., 2019) as the proxy for the meaning contents of words to study the contextualization capability of BERT. Specifically, we use the Wiki-PSE resource (Yaghoobzadeh et al., 2019); see §3.1 for details.

words comb’s contexts
train 35,399 62,184 2,178,895
dev 8,850 15,437 542,938
test 44,250 77,706 2,722,893
Table 2: Number of words, word-s-class combinations and contexts per split in our probing dataset.

3 Probing Dataset and Task

3.1 Probing dataset

An s-class labeled corpus is needed for s-class probing. We use Wiki-PSE (Yaghoobzadeh et al., 2019); it consists of a set of 34 s-classes, an inventory of words-class mappings and an English Wikipedia corpus in which words in context are labeled with the 34 s-classes. E.g., contexts of “Apple” that refer to the company are labeled with “organization”. We refer to a word labeled with an s-class as a word-s-class combination, e.g., “@apple@-organization”.111In Wiki-PSE, s-class-labeled occurrences are enclosed with “@”, e.g., “@apple@”.

The Wiki-PSE text corpus contains 550 million tokens, 17 million of which are annotated with an s-class. Working on the entire Wiki-PSE with BERT is not feasible, e.g., the word-s-class combination “@france@-location” occurs 98,582 times. Processing all these contexts by BERT consumes significant amounts of energy (Strubell et al., 2019) and time. Hence for each word-s-class combination, we sample 100 contexts (or all contexts if there are fewer than 100) to speed up our experiments. Wiki-PSE provides a balanced train/test split; we use of the training set as our development set. Table 2 gives statistics of our dataset.

3.2 Probing for semantic classes

For each of the 34 s-classes in Wiki-PSE, we train a binary classifier to diagnose if an input embedding encodes information for inferring the s-class.

3.2.1 Probing uncontextualized embeddings

We want to make a distinction in this paper between two different factors that contribute to BERT’s performance: (i) a powerful learning architecture that gives rise to high-quality representations and (ii) contextualization in applications, i.e., words are represented as contextualized embeddings for solving NLP tasks. Here, we adopt Schuster et al. (2019)’s method of computing uncontextualized BERT embeddings (AVG-BERT-, see §4.2.1) and show that (i) alone already has a strong positive effect on performance when compared to uncontextualized embeddings. So BERT’s representation learning yields high performance, even when used in a completely uncontextualized setting.

We adopt the setup in Yaghoobzadeh et al. (2019)

to probe uncontextualized embeddings – for each of the 34 s-classes, we train one multi-layer perceptron (MLP) classifier as shown in Figure 

1. Table 2, column words shows the sizes of train/dev/test. The evaluation measure is micro over all decisions of the 34 binary classifiers.

1:procedure TypeSclsTrainer(Dict: word2vec, Dict: word2sclass, sclass: , List: TrainWords):
2:     PosVecs, NegVecs = [], []
3:     for word TrainWords do

          vector = word2vec.get(word)

5:          sclasses = word2sclass.get(word)
6:          if  sclasses then
7:               PosVecs.append(vector)
8:          else
9:               NegVecs.append(vector)                
10:     mlp = MLP()
11:     mlp.train(PosVecs, NegVecs)
12:     return mlp
Algorithm 1 Train an MLP with type-level embeddings
Figure 1: Training a diagnostic classifier with uncontextualized word representations for an s-class .

3.2.2 Probing contextualized embeddings

We probe BERT with the same setup: one MLP is trained for each of the 34 s-classes; each BERT layer is probed individually.

For uncontextualized embeddings, a word has a single vector, which is either a positive or negative example for an s-class. For contextualized embeddings, the contexts of a word will typically be mixed; for example, “food” contexts (a candy) of “@airheads@” are positive and “art” contexts (a film) of “@airheads@” are negative examples for the MLP “food”. Table 2, column contexts shows the sizes of train/dev/test when probing BERT. Figure 2 compares our probing setups.

When evaluating the MLPs, we want to weight frequent word-s-class combinations (those having 100 contexts in our dataset) and the much larger number of infrequent word-s-class combinations equally. To this end, we aggregate the decisions for the contexts of a word-s-class combination. We stipulate that at least half of the contexts must be correctly classified. For example, “@airheads@-art” occurs 47 times, so we evaluate the “art” MLP as accurate for “@airheads@-art” if it classifies at least 24 contexts correctly. The final evaluation measure is micro over all 15,437 (for dev) and 77,706 (for test) decisions (see Table 2) of the 34 MLPs for the word-s-class combinations.

Figure 2: Setups for probing uncontextualized and contextualized embeddings. For BERT, we input a context sentence to extract the contextualized embedding of a word, e.g., “airheads”; “food” is the correct s-class label for this context.

4 Experiments and Results

4.1 Data preprocessing

BERT uses wordpieces (Wu et al., 2016) to represent text. Many words are tokenized to several wordpieces. For example, “infrequent” is tokenized to “in”, “##fr”, “##e”, and “##quent”. In this paper, we average wordpiece embeddings to get a single vector representation of a word.222Some “words” in Wiki-PSE are in reality multiword phrases. Again, we average in these cases to get a single vector representation.

We limit the maximum sequence length of the context sentence input to BERT to 128. Consistent with the probing literature, we use a simple probing classifier: a 1-layer MLP with 1024 hidden dimensions and ReLU.

4.2 Quantifying contextualization

4.2.1 Representation learners

Six uncontextualized embedding spaces are evaluated: (i) PSE. A 300-dimensional embedding space computed by running skipgram with negative sampling (Mikolov et al., 2013) on the Wiki-PSE text corpus. Yaghoobzadeh et al. (2019) show that PSE outperforms other standard embedding models. (ii) Rand. An embedding space with the same vocabulary and dimension size as PSE. Each vector is drawn from . Rand is added to confirm that word representations indeed encode valid meaning contents that can be identified by diagnostic MLPs rather than random weights. (iii) The 300-dimensional fastText (Bojanowski et al., 2017) embeddings. (iv) GloVe. The 300-dimensional space trained on 6 billion tokens (Pennington et al., 2014). Out-of-vocabulary (OOV) words are associated with vectors drawn from . (v) BERTw. The 768-dimensional wordpiece embeddings in BERT. We tokenize a word with the BERT tokenizer then average its wordpiece embeddings. (vi) AVG-BERT-.333BERTw and AVG-BERT- have more dimensions. But Yaghoobzadeh et al. (2019) showed that different dimensionalities have a negligible impact on relative performance when probing for s-classes using MLPs as diagnostic classifiers. For an annotated word in Wiki-PSE, we average all of its contextualized embeddings from BERT layer in the Wiki-PSE text corpus. Comparing AVG-BERT- with others brings a new insight: to which extent does this “uncontextualized” variant of BERT outperform others in encoding different s-classes of a word?

Four contextualized embedding models

are considered: (i) BERT. We use the PyTorch

(Paszke et al., 2019; Wolf et al., 2019) version of the 12-layer bert-base-uncased model (Wiki-PSE is uncased). (ii) P-BERT. A bag-of-word model that “contextualizes” the wordpiece embedding of an annotated word by averaging the embeddings of wordpieces of the sentence it occurs in. P-BERT serves as a baseline against which we can quantify how much contextualization in different BERT layers helps in s-class inference. (iii) P-fastText. Similar to P-BERT, but we use fastText word embeddings. Comparing BERT with P-fastText indicates to which extent BERT outperforms standard embedding spaces when they also have access to contextual information. (iv) P-Rand. Similar to P-BERT, but we draw word embeddings from . Wieting and Kiela (2019) show that a random baseline has good performance in tasks like sentence classification.

Standard Embeddings AVG-BERT-
Rand BERTw fastText GloVe PSE 0 1 2 3 4 5 6 7 8 9 10 11
dev .269 .653 .625 .681 .790 .746 .759 .764 .775 .786 .791 .794 .805 .811 .812 .813 .809
test .267 .652 .626 .680 .787 .744 .756 .762 .773 .783 .788 .790 .802 .806 .809 .808 .806
Table 3: S-class probing results for uncontextualized embeddings. Numbers are micro on Wiki-PSE. Our result (0.787 on PSE-test) is consistent with Yaghoobzadeh et al. (2019).
Bag-of-word context BERT Layer
P-Rand P-fastText P-BERT 0 1 2 3 4 5 6 7 8 9 10 11
dev .637 .707 .672 .649 .692 .711 .739 .771 .782 .795 .813 .826 .832 .836 .835
test .630 .707 .670 .645 .688 .708 .737 .766 .777 .790 .810 .824 .828 .830 .831
Table 4: S-class probing results for contextualized embedding models. Numbers are micro on Wiki-PSE.

4.2.2 S-class inference results

Table 3 shows uncontextualized embedding probing results. Comparing with random weights, all embedding spaces encode informative features helping s-class inference. BERTw delivers results similar to GloVe and fastText, demonstrating our earlier point (cf. the qualitative example in Table 1) that the lowest layer of BERT is uncontextualized.

PSE performs strongly, consistent with observations in Yaghoobzadeh et al. (2019). AVG-BERT-, e.g., AVG-BERT-10, performs best among all spaces. Thus for a given word, averaging its contextualized embeddings from BERT yields a high quality type-level embedding vector, similar to “anchor words” in cross-lingual alignment (Schuster et al., 2019). As expected, the top AVG-BERT layers outperform lower layers, given the deep architecture of BERT. Additionally, AVG-BERT-0 significantly outperforms BERTw, evidencing the importance of self attention (Vaswani et al., 2017) when composing the wordpieces of a word.

Table 4 shows contextualized embedding probing results. Comparing BERT layers, a clear trend can be identified: s-class inference performance increases monotonically with higher layers. This increase levels off in the top layers. Thus, the features from deeper layers are more abstract and benefit s-class inference. It also verifies previous findings: semantic tasks are mainly solved at higher layers (Liu et al., 2019; Tenney et al., 2019a). We can also observe that the strongest contextualization occurs early at lower layers – going up to layer 1 from layer 0 brings a 4% (absolute) improvement.

The very limited contextualization improvement brought by the top two layers may explain why representations from the top layers of BERT can deliver suboptimal performance on NLP tasks (Liu et al., 2019): the top layers are optimized for the pretraining objective, i.e., predicting masked words (Voita et al., 2019), not for the contextualization of words that is helpful for NLP tasks.

BERT layer 0 performs slightly worse than P-BERT, which may be due to the fact that some attention heads in lower layers of BERT attend broadly in the sentence, producing “bag-of-vector-like” representations (Clark et al., 2019), which is in fact close to the setup of P-BERT. However, starting from layer 1, BERT gradually improves and surpasses P-BERT, achieving a maximum delta of .161 in (.831-.670, layer 11 on test). Thus, BERT knows how to better interpret the word in context, i.e., contextualize the word, when progressively going to deeper (higher) layers.

P-Rand performs strongly, but is noticeably worse than P-fastText and P-BERT. P-fastText outperforms P-BERT and BERT layers 0 and 1. We conjecture that this is due to the fact that fastText learns embeddings directly for words; P-BERT and BERT have to compose subwords to understand the meaning of a word, which is more challenging. However, starting from layer 2, BERT outperforms P-fastText and P-BERT, illustrating the effectiveness of self attention in better integrating the information from the context into contextualized embeddings than simple averaging in bag-of-word models.

Table 3 and Table 4 jointly illustrate the high quality of word representations computed by BERT. The BERT-derived uncontextualized AVG-BERT- representations – modeled on Schuster et al. (2019)’s anchor words – show superior capability in inferring s-classes of a word, performing best among all uncontextualized embeddings. This suggests that BERT’s powerful learning architecture may be the main reason for BERT’s high performance, not contextualization proper, i.e., the representation of words as contextualized embeddings on the highest layer when BERT is applied to NLP tasks. This offers intriguing possibility for creating strongly performing uncontextualized BERT-derived models that are more compact and more efficiently deployable.

4.2.3 Qualitative analysis

§4.2.2 quantitatively shows BERT performs strongly in contextualizing words, thanks to its deep integration of information from the entire input sentence in each contextualized embedding. But there are scenarios where BERT fails. We identify two such cases in which the contextual information does not help s-class inference.

(i) Tokenization. In some domains, the annotated word and/or its context words are tokenized into several wordpieces and BERT may not be able to derive the correct composed meaning. Then the MLPs cannot identify the correct s-class from the noisy input. Consider the tokenized results of “@glutamate@-biology” and one of its contexts:

“three ne ##uro ##tra ##ns ##mit ##ters that play important roles in adolescent brain development are g ##lu ##tama ##te

Though “brain development” hints at a context related to “biology”, this signal could be swamped by the noise in embeddings of other – especially short – wordpieces. Mimicking approaches (Pinter et al., 2017) are proposed to assist BERT to understand rare words (Schick and Schütze, 2019).

(ii) Uninformative contexts. Some contexts do not provide sufficient information related to the s-class. For example, according to probing results on BERTw, the wordpiece embedding of “goodfellas” does not encode the meaning of s-class “art” (or movies); the context “Chase also said he wanted Imperioli because he had been in Goodfellas” of word-s-class combination “@goodfellas@-art” is not informative enough for inferring an “art” context, yielding incorrect predictions in higher layers.

4.3 Context size

We now quantify the size of the context that BERT requires for accurate s-class inference.

When probing for the s-class of word , we define context size as the number of words surrounding

(left and right). For example, a context size of 5 means 5 words left, 5 words right. This definition is with respect to words, not wordpieces, so the number of wordpieces generally will be higher. The context size seems to be picked heuristically in other work.

Yarowsky (1992) and Gale et al. (1992) use 50 while Black (1988) use 3–6. We experiment with a range of context sizes then compare s-class inference results. We also enclose P-BERT for comparison. Note that this experiment is different from edge probing (Tenney et al., 2019b), which takes the full sentence as input. We only make input words within the context window available to BERT and P-BERT.

Figure 3: Probing results on the dev set with different context sizes. For BERT, performance increases with context size. Large context sizes like 16 and 32 slightly hurt performance of P-BERT.

4.3.1 Probing results

Similar to §4.2.2, we report micro on Wiki-PSE dev, with context size . Context size 0 means that the input consists only of the wordpiece embeddings of the input word. Figure 3 shows results.

Comparing context sizes. Larger context sizes have higher performance for all BERT layers. Improvements are most prominent for small context sizes, e.g., 2 and 4, meaning that often local features are sufficient to infer s-classes. Further increasing the context size improves performances marginally.

A qualitative example showing informative local features is “The Azande speak Zande, which they call Pa-Zande.” In this context, the gold s-class of “Zande” is “language” (instead of “people-ethnicity”, i.e., the Zande people). The MLPs for BERTw and for context size 0 for BERT fail to identify s-class “language”. But the BERT MLP for context size 2 predicts “language” correctly since it includes the strong signal “speak”. This context is a case of selectional restrictions (Resnik, 1993; Jurafsky and Martin, 2009), in this case possible objects of “speak”.

As local features in small context sizes already bring noticeable improvements on s-class inference performance, we hypothesize that it may not be necessary to exploit the full context in (e.g., mobile) applications where the quadratic complexity of full-sentence attention is problematic.

P-BERT shows a similar pattern when comparing context sizes. However, large context sizes such as 16 and 32 hurt performance, meaning that averaging the embeddings of too many context words swamps the identity of the annotated target word.

Comparing BERT layers. Higher layers of BERT yield better contextualized word embeddings. This phenomenon is more noticeable for large context sizes such as 8, 16 and 32. However for small context sizes, e.g., 0, embeddings from all layers perform similarly and badly. This means that without context information, simply passing the wordpiece embedding of a word through BERT layers does not help, suggesting that contextualization is the key factor for the BERT’s impressive s-class inference performance.

Again, P-BERT only outperforms layer 0 of BERT with most context sizes, suggesting that BERT layers, especially the top layers, produce abstract and informative features for accurate s-class inference, instead of naively considering all information within the context sentence.

4.4 Probing finetuned embeddings

Up to now, we have done “classical” probing: we extract features from weight-frozen pretrained BERT and feed them to diagnostic classifiers. Now we turn to the question: how do pretrained knowledge and probed features change when BERT is finetuned on downstream tasks?

Figure 4: Comparing s-class inference results of pretrained BERT and BERT finetuned on POS, SST-2 and MRPC. “Pretrained”: probing results on weight-frozen pretrained BERT in §4.2. (a): probing results of the MLPs from §4.2. (b): probing results of new MLPs trained with word embeddings from finetuned BERT.

4.4.1 Finetuning tasks

We finetune BERT on three tasks: part-of-speech (POS) tagging on the Penn Treebank (Marcus et al., 1993), binary sentiment classification on the Stanford Sentiment Treebank (SST-2) (Socher et al., 2013) and paraphrase detection on the Microsoft Research Paraphrase Corpus (MRPC) (Dolan and Brockett, 2005). For SST-2 and MRPC, we use the GLUE train and dev sets (Wang et al., 2018). For POS, sections 0-18 of WSJ are train and sections 19-21 are dev (Collins, 2002).

Following Devlin et al. (2019)

, we put a softmax layer on top of the pretrained BERT, then finetune all trainable parameters. We use Adam

(Kingma and Ba, 2014)

with learning rate 5e-5 for 5 epochs. We save the model from the step that performs best on dev (of POS/SST-2/MRPC), extract representations from Wiki-PSE using this model and then report results on Wiki-PSE dev. Our models’ performance is close to

Devlin et al. (2019).

4.4.2 Probing results

We now quantify the contextualization of word representations from finetuned BERT models. Two setups are considered: (a) directly apply the MLPs in §4.2 (trained with pretrained embeddings) to embeddings computed by finetuned BERT; (b) train and evaluate a new set of MLPs on the embeddings computed by finetuned BERT.

Comparing (a) with probing results on pretrained BERT (§4.2) gives us an intuition about how much change occurred to the knowledge captured during pretraining. Comparing (b) with §4.2 reveals whether or not the pretrained knowledge is still preserved in finetuned models.

Figure 4 shows s-class probing results on Wiki-PSE dev of pretrained BERT (“Pretrained”), POS-, SST-2- and MRPC-finetuned BERT in setups (a) and (b); e.g., layer 11 s-class inference performance of the POS-finetuned BERT decreases by 0.763 (0.835 0.072, from “Pretrained” to “POS-(a)”) when using the MLPs from §4.2 (setup (a)).

Comparing finetuning tasks. Finetuning BERT on SST-2 and MRPC introduces much smaller performance decreases than for POS and this holds for both (a) and (b). For setup (b), finetuning BERT on MRPC introduces small but consistent improvements on s-class inference. S-class inference, SST-2 and MRPC are somewhat similar tasks, all concerned with semantics; this likely explains this limited performance change (Mou et al., 2016). On the other hand, the syntactic task POS seems to encourage BERT to “propagate” the syntactic information that is represented on lower layers to the upper layers; due to their limited capacity, the fixed-size vectors in the upper layers then lose semantic information and a more significant performance drop occurs on s-class probing (see green dotted line).

Comparing BERT layers. Contextualized embeddings from BERT’s top layers are strongly affected by finetuning, especially for setup (a). In contrast, lower layers are more invariant and show s-class inference results similar to the pretrained model. Hao et al. (2019), Lee et al. (2019), Kovaleva et al. (2019) make similar observations: contextualized embeddings from lower layers are more transferable across different tasks and contextualized embeddings from top layers are more task-specific after finetuning.

Figure 5 shows the cosine similarity of the flattened self attention weights computed by pretrained, POS- and MRPC-finetuned BERT. We see that top layers are more sensitive to finetuning (darker color) while lower layers are barely changed (lighter color). Top layers have much larger changes for POS than for MRPC, in line with probing results in Figure 4.

Comparing (a) and (b), we see that the knowledge captured during pretraining is still preserved to some extent after finetuning. For example, the MLPs trained with layer 11 embeddings computed by the POS-finetuned BERT achieve a reasonably good score of 0.735 (a 0.100 drop compared with “Pretrained” – compare black and green dotted lines in Figure 4). Thus, the semantic information needed for inferring s-classes is still present to a large extent. Finetuning on a task may contribute to large changes in the representation – similar to the projection utilized to uncover divergent information in uncontextualized word embeddings (Artetxe et al., 2018) – but relatively little information is lost as the good performance of the newly trained MLPs shows.

i pretrained vs. POS
ii pretrained vs. MRPC
Figure 5: Cosine similarity of flattened self attention weights computed by pretrained, POS- and MRPC-finetuned BERT. X-axis: index of the 12 self attention heads; y-axis: layer index. Darker colors: smaller similarities = larger changes brought by finetuning.

5 Related Work

Interpreting deep networks. Pretrained language models (McCann et al., 2017; Peters et al., 2018a; Radford et al., 2019; Devlin et al., 2019) advance NLP by contextualized token-level representations of words. A key goal of current research is to understand how these models work and what they represent on different layers. Interpretation through, e.g., feature visualization (Maaten and Hinton, 2008), attention analysis (Xu et al., 2015), and gradient inspection (Shrikumar et al., 2017) helps in better understanding their impressive performance.

Probing tasks. Probing is a recent strand of work that investigates – via diagnostic classifiers – desired syntactic and semantic features encoded in pretrained language model representations. Shi et al. (2016) show that string-based RNNs encode syntactic information. Belinkov et al. (2017) investigate word representations at different layers in NMT. Linzen et al. (2016) assess the syntactic ability of LSTM (Hochreiter and Schmidhuber, 1997) encoders and Goldberg (2019) of BERT. Tenney et al. (2019a) find that information on POS tagging, parsing, NER, semantic roles and coreference is represented on progressively higher layers of BERT. Yaghoobzadeh et al. (2019) assess the disambiguation properties of type-level word representations. Liu et al. (2019) and Lin et al. (2019) investigate the linguistic knowledge encoded in BERT. Adi et al. (2016), Conneau et al. (2018) and Wieting and Kiela (2019) study sentence embedding properties via probing. Peters et al. (2018b) probe how the network architecture affects the learned vectors. In all of these studies, probing serves to analyze representations and reveal their properties. We employ probing to investigate the contextualization of words in pretrained language models quantitatively.

Ethayarajh (2019) also quantitatively investigates contextualized embeddings using cosine-similarity-based intrinsic evaluation. Inferring s-classes, we address a complementary set of questions because we can quantify contextualization with a uniform set of semantic classes.

Two-stage NLP paradigm. Recent work like ULMFiT (Howard and Ruder, 2018) and BERT popularizes the “two-stage paradigm” in NLP (Dai and Le, 2015)

: pretrain a language encoder on a large amount of unlabeled data via self-supervised learning, then finetune the encoder on task-specific benchmarks like GLUE

(Wang et al., 2018, 2019; Peters et al., 2019)

. This transfer-learning pipeline yields good and robust results compared to models trained from scratch

(Hao et al., 2019). In this work, we shed light on how BERT’s pretrained knowledge changes during finetuning by comparing s-class inference ability of pretrained and finetuned BERT models.

6 Conclusion

We presented a quantitative study of the contextualization of words in BERT by investigating the semantic class inference capabilities of BERT. Our study complements qualitative prior work (e.g., Tenney et al. (2019a)) and the cosine-similarity-based intrinsic evaluations in Ethayarajh (2019). We investigated two key factors for successful contextualization by BERT: layer index and context size. By comparing pretrained and finetuned BERT, we showed that the similarity between pretraining and finetuning objectives heavily influences contextualized embeddings; top layers of BERT are more sensitive to the finetuning objective than lower layers. We also found that BERT’s pretrained knowledge is still preserved to a large extent after finetuning.

We showed that exploiting the full context may be unnecessary in applications where the quadratic complexity of full-sentence attention is problematic. We plan to evaluate this phenomenon on more datasets and downstream tasks in future work.