Pretrained language models like ELMo (Peters et al., 2018a), BERT (Devlin et al., 2019) and XLNet (Yang et al., 2019) are top performers in NLP because they learn contextualized representations, i.e., representations that reflect the interpretation of a word in context as opposed to its general meaning, which is less helpful in solving NLP tasks.
While what we just stated – pretrained language models contextualize words – is clear qualitatively, there has been little work on investigating contextualization, i.e., to which extent a word can be interpreted in context, quantitatively.
We use BERT (Devlin et al., 2019) as our pretrained language model and quantify contextualization by investigating how well BERT infers semantic classes (s-classes) of a word in context, e.g., the s-class organization for “Apple” in “Apple stock rises” vs. the s-class food in “Apple juice is healthy”. We use s-class inference as a proxy for contextualization since accurate s-class inference reflects a successful contextualization of a word – an effective interpretation of the word in context.
: diagnostic classifiers are applied to pretrained language model embeddings to determine whether they encode desired syntactic or semantic features. By probing for s-classes we can quantify directly where and how contextualization happens in BERT. E.g., we find that the strongest contextual interpretation effects occur in the lower layers and that the top two layers contribute little to contextualization. We also investigate how the amount of context available and finetuning affect contextualization.
We make the following contributions: (i) We investigate how accurately BERT interprets words in context. Our quantitative methodology is complementary to existing qualitative interpretations (Tenney et al., 2019a)
and cosine-similarity-based intrinsic evaluations(Ethayarajh, 2019) for investigating the “contextual extent” of word representations. We find that BERT’s performance is high (almost 85% ), but that there is still room for improvement. (ii) We quantify how much each additional layer contributes in BERT. We find that the strongest contextual interpretation effects occur in the lower layers. The top two layers seem to be optimized only for the pretraining objective of predicting masked words (Devlin et al., 2019) and only add small increments to contextualization. (iii) We investigate the amount of context BERT can exploit for interpreting a word and find that BERT effectively integrates local context up to five words to the left and to the right (a 10-word context window). (iv) We investigate the dynamics of BERT’s representations in finetuning. We find that finetuning has little effect on lower layers, suggesting that they are more easily transferable across tasks. Higher layers are strongly changed for a nonsemantic finetuning objective, but little for a semantic task like paraphrase classification. Finetuning uncovers task-related features, but the knowledge captured in pretraining is largely preserved. We quantify these effects by s-class inference performance.
2 Motivation and Methodology
The key benefit of pretrained language models (McCann et al., 2017; Peters et al., 2018a; Radford et al., 2019; Devlin et al., 2019) is that they produce contextualized embeddings that are useful in NLP. The top layer contextualized word representations from pretrained language models are widely utilized; however, the fact that pretrained language models implement a process of contextualization – starting with a completely uncontextualized layer of wordpieces at the bottom – is not well studied. Table 1 gives an example: BERT’s wordpiece embedding of “suit” is not contextualized: it contains several meanings of the word, including “to suit” (“be convenient”), lawsuit and clothes (“slacks”). Thus, there is no difference in this respect between BERT’s wordpiece embeddings and standard uncontextualized word embeddings like GloVe (Pennington et al., 2014). Pretrained language models start out with an uncontextualized representation at the lowest layer, then gradually contextualize it. This is the process we analyze in this paper.
For investigating the contextualization process, one possibility is to use word senses and to tap resources like the WordNet (WN) (Fellbaum, 1998) based word sense disambiguation benchmarks of the Senseval series (Edmonds and Cotton, 2001; Snyder and Palmer, 2004; Raganato et al., 2017). However, the abstraction level in WN sense inventories has been criticized as too fine-grained (Izquierdo et al., 2009), providing limited information to applications requiring higher level abstraction. Various levels of granularity of abstraction have been explored such as WN domains (Magnini and Cavaglià, 2000), supersenses (Ciaramita and Johnson, 2003; Levine et al., 2019) and basic level concepts (Beviá et al., 2007). In this paper, we use semantic classes (s-classes) (Yarowsky, 1992; Resnik, 1993; Kohomban and Lee, 2005; Yaghoobzadeh et al., 2019) as the proxy for the meaning contents of words to study the contextualization capability of BERT. Specifically, we use the Wiki-PSE resource (Yaghoobzadeh et al., 2019); see §3.1 for details.
3 Probing Dataset and Task
3.1 Probing dataset
An s-class labeled corpus is needed for s-class probing. We use Wiki-PSE (Yaghoobzadeh et al., 2019); it consists of a set of 34 s-classes, an inventory of words-class mappings and an English Wikipedia corpus in which words in context are labeled with the 34 s-classes. E.g., contexts of “Apple” that refer to the company are labeled with “organization”. We refer to a word labeled with an s-class as a word-s-class combination, e.g., “@apple@-organization”.111In Wiki-PSE, s-class-labeled occurrences are enclosed with “@”, e.g., “@apple@”.
The Wiki-PSE text corpus contains 550 million tokens, 17 million of which are annotated with an s-class. Working on the entire Wiki-PSE with BERT is not feasible, e.g., the word-s-class combination “@france@-location” occurs 98,582 times. Processing all these contexts by BERT consumes significant amounts of energy (Strubell et al., 2019) and time. Hence for each word-s-class combination, we sample 100 contexts (or all contexts if there are fewer than 100) to speed up our experiments. Wiki-PSE provides a balanced train/test split; we use of the training set as our development set. Table 2 gives statistics of our dataset.
3.2 Probing for semantic classes
For each of the 34 s-classes in Wiki-PSE, we train a binary classifier to diagnose if an input embedding encodes information for inferring the s-class.
3.2.1 Probing uncontextualized embeddings
We want to make a distinction in this paper between two different factors that contribute to BERT’s performance: (i) a powerful learning architecture that gives rise to high-quality representations and (ii) contextualization in applications, i.e., words are represented as contextualized embeddings for solving NLP tasks. Here, we adopt Schuster et al. (2019)’s method of computing uncontextualized BERT embeddings (AVG-BERT-, see §4.2.1) and show that (i) alone already has a strong positive effect on performance when compared to uncontextualized embeddings. So BERT’s representation learning yields high performance, even when used in a completely uncontextualized setting.
We adopt the setup in Yaghoobzadeh et al. (2019)
to probe uncontextualized embeddings – for each of the 34 s-classes, we train one multi-layer perceptron (MLP) classifier as shown in Figure1. Table 2, column words shows the sizes of train/dev/test. The evaluation measure is micro over all decisions of the 34 binary classifiers.
3.2.2 Probing contextualized embeddings
We probe BERT with the same setup: one MLP is trained for each of the 34 s-classes; each BERT layer is probed individually.
For uncontextualized embeddings, a word has a single vector, which is either a positive or negative example for an s-class. For contextualized embeddings, the contexts of a word will typically be mixed; for example, “food” contexts (a candy) of “@airheads@” are positive and “art” contexts (a film) of “@airheads@” are negative examples for the MLP “food”. Table 2, column contexts shows the sizes of train/dev/test when probing BERT. Figure 2 compares our probing setups.
When evaluating the MLPs, we want to weight frequent word-s-class combinations (those having 100 contexts in our dataset) and the much larger number of infrequent word-s-class combinations equally. To this end, we aggregate the decisions for the contexts of a word-s-class combination. We stipulate that at least half of the contexts must be correctly classified. For example, “@airheads@-art” occurs 47 times, so we evaluate the “art” MLP as accurate for “@airheads@-art” if it classifies at least 24 contexts correctly. The final evaluation measure is micro over all 15,437 (for dev) and 77,706 (for test) decisions (see Table 2) of the 34 MLPs for the word-s-class combinations.
4 Experiments and Results
4.1 Data preprocessing
BERT uses wordpieces (Wu et al., 2016) to represent text. Many words are tokenized to several wordpieces. For example, “infrequent” is tokenized to “in”, “##fr”, “##e”, and “##quent”. In this paper, we average wordpiece embeddings to get a single vector representation of a word.222Some “words” in Wiki-PSE are in reality multiword phrases. Again, we average in these cases to get a single vector representation.
We limit the maximum sequence length of the context sentence input to BERT to 128. Consistent with the probing literature, we use a simple probing classifier: a 1-layer MLP with 1024 hidden dimensions and ReLU.
4.2 Quantifying contextualization
4.2.1 Representation learners
Six uncontextualized embedding spaces are evaluated: (i) PSE. A 300-dimensional embedding space computed by running skipgram with negative sampling (Mikolov et al., 2013) on the Wiki-PSE text corpus. Yaghoobzadeh et al. (2019) show that PSE outperforms other standard embedding models. (ii) Rand. An embedding space with the same vocabulary and dimension size as PSE. Each vector is drawn from . Rand is added to confirm that word representations indeed encode valid meaning contents that can be identified by diagnostic MLPs rather than random weights. (iii) The 300-dimensional fastText (Bojanowski et al., 2017) embeddings. (iv) GloVe. The 300-dimensional space trained on 6 billion tokens (Pennington et al., 2014). Out-of-vocabulary (OOV) words are associated with vectors drawn from . (v) BERTw. The 768-dimensional wordpiece embeddings in BERT. We tokenize a word with the BERT tokenizer then average its wordpiece embeddings. (vi) AVG-BERT-.333BERTw and AVG-BERT- have more dimensions. But Yaghoobzadeh et al. (2019) showed that different dimensionalities have a negligible impact on relative performance when probing for s-classes using MLPs as diagnostic classifiers. For an annotated word in Wiki-PSE, we average all of its contextualized embeddings from BERT layer in the Wiki-PSE text corpus. Comparing AVG-BERT- with others brings a new insight: to which extent does this “uncontextualized” variant of BERT outperform others in encoding different s-classes of a word?
Four contextualized embedding models
are considered: (i) BERT. We use the PyTorch(Paszke et al., 2019; Wolf et al., 2019) version of the 12-layer bert-base-uncased model (Wiki-PSE is uncased). (ii) P-BERT. A bag-of-word model that “contextualizes” the wordpiece embedding of an annotated word by averaging the embeddings of wordpieces of the sentence it occurs in. P-BERT serves as a baseline against which we can quantify how much contextualization in different BERT layers helps in s-class inference. (iii) P-fastText. Similar to P-BERT, but we use fastText word embeddings. Comparing BERT with P-fastText indicates to which extent BERT outperforms standard embedding spaces when they also have access to contextual information. (iv) P-Rand. Similar to P-BERT, but we draw word embeddings from . Wieting and Kiela (2019) show that a random baseline has good performance in tasks like sentence classification.
|Bag-of-word context||BERT Layer|
4.2.2 S-class inference results
Table 3 shows uncontextualized embedding probing results. Comparing with random weights, all embedding spaces encode informative features helping s-class inference. BERTw delivers results similar to GloVe and fastText, demonstrating our earlier point (cf. the qualitative example in Table 1) that the lowest layer of BERT is uncontextualized.
PSE performs strongly, consistent with observations in Yaghoobzadeh et al. (2019). AVG-BERT-, e.g., AVG-BERT-10, performs best among all spaces. Thus for a given word, averaging its contextualized embeddings from BERT yields a high quality type-level embedding vector, similar to “anchor words” in cross-lingual alignment (Schuster et al., 2019). As expected, the top AVG-BERT layers outperform lower layers, given the deep architecture of BERT. Additionally, AVG-BERT-0 significantly outperforms BERTw, evidencing the importance of self attention (Vaswani et al., 2017) when composing the wordpieces of a word.
Table 4 shows contextualized embedding probing results. Comparing BERT layers, a clear trend can be identified: s-class inference performance increases monotonically with higher layers. This increase levels off in the top layers. Thus, the features from deeper layers are more abstract and benefit s-class inference. It also verifies previous findings: semantic tasks are mainly solved at higher layers (Liu et al., 2019; Tenney et al., 2019a). We can also observe that the strongest contextualization occurs early at lower layers – going up to layer 1 from layer 0 brings a 4% (absolute) improvement.
The very limited contextualization improvement brought by the top two layers may explain why representations from the top layers of BERT can deliver suboptimal performance on NLP tasks (Liu et al., 2019): the top layers are optimized for the pretraining objective, i.e., predicting masked words (Voita et al., 2019), not for the contextualization of words that is helpful for NLP tasks.
BERT layer 0 performs slightly worse than P-BERT, which may be due to the fact that some attention heads in lower layers of BERT attend broadly in the sentence, producing “bag-of-vector-like” representations (Clark et al., 2019), which is in fact close to the setup of P-BERT. However, starting from layer 1, BERT gradually improves and surpasses P-BERT, achieving a maximum delta of .161 in (.831-.670, layer 11 on test). Thus, BERT knows how to better interpret the word in context, i.e., contextualize the word, when progressively going to deeper (higher) layers.
P-Rand performs strongly, but is noticeably worse than P-fastText and P-BERT. P-fastText outperforms P-BERT and BERT layers 0 and 1. We conjecture that this is due to the fact that fastText learns embeddings directly for words; P-BERT and BERT have to compose subwords to understand the meaning of a word, which is more challenging. However, starting from layer 2, BERT outperforms P-fastText and P-BERT, illustrating the effectiveness of self attention in better integrating the information from the context into contextualized embeddings than simple averaging in bag-of-word models.
Table 3 and Table 4 jointly illustrate the high quality of word representations computed by BERT. The BERT-derived uncontextualized AVG-BERT- representations – modeled on Schuster et al. (2019)’s anchor words – show superior capability in inferring s-classes of a word, performing best among all uncontextualized embeddings. This suggests that BERT’s powerful learning architecture may be the main reason for BERT’s high performance, not contextualization proper, i.e., the representation of words as contextualized embeddings on the highest layer when BERT is applied to NLP tasks. This offers intriguing possibility for creating strongly performing uncontextualized BERT-derived models that are more compact and more efficiently deployable.
4.2.3 Qualitative analysis
§4.2.2 quantitatively shows BERT performs strongly in contextualizing words, thanks to its deep integration of information from the entire input sentence in each contextualized embedding. But there are scenarios where BERT fails. We identify two such cases in which the contextual information does not help s-class inference.
(i) Tokenization. In some domains, the annotated word and/or its context words are tokenized into several wordpieces and BERT may not be able to derive the correct composed meaning. Then the MLPs cannot identify the correct s-class from the noisy input. Consider the tokenized results of “@glutamate@-biology” and one of its contexts:
“three ne ##uro ##tra ##ns ##mit ##ters that play important roles in adolescent brain development are g ##lu ##tama ##te …
Though “brain development” hints at a context related to “biology”, this signal could be swamped by the noise in embeddings of other – especially short – wordpieces. Mimicking approaches (Pinter et al., 2017) are proposed to assist BERT to understand rare words (Schick and Schütze, 2019).
(ii) Uninformative contexts. Some contexts do not provide sufficient information related to the s-class. For example, according to probing results on BERTw, the wordpiece embedding of “goodfellas” does not encode the meaning of s-class “art” (or movies); the context “Chase also said he wanted Imperioli because he had been in Goodfellas” of word-s-class combination “@goodfellas@-art” is not informative enough for inferring an “art” context, yielding incorrect predictions in higher layers.
4.3 Context size
We now quantify the size of the context that BERT requires for accurate s-class inference.
When probing for the s-class of word , we define context size as the number of words surrounding
(left and right). For example, a context size of 5 means 5 words left, 5 words right. This definition is with respect to words, not wordpieces, so the number of wordpieces generally will be higher. The context size seems to be picked heuristically in other work.Yarowsky (1992) and Gale et al. (1992) use 50 while Black (1988) use 3–6. We experiment with a range of context sizes then compare s-class inference results. We also enclose P-BERT for comparison. Note that this experiment is different from edge probing (Tenney et al., 2019b), which takes the full sentence as input. We only make input words within the context window available to BERT and P-BERT.
4.3.1 Probing results
Comparing context sizes. Larger context sizes have higher performance for all BERT layers. Improvements are most prominent for small context sizes, e.g., 2 and 4, meaning that often local features are sufficient to infer s-classes. Further increasing the context size improves performances marginally.
A qualitative example showing informative local features is “The Azande speak Zande, which they call Pa-Zande.” In this context, the gold s-class of “Zande” is “language” (instead of “people-ethnicity”, i.e., the Zande people). The MLPs for BERTw and for context size 0 for BERT fail to identify s-class “language”. But the BERT MLP for context size 2 predicts “language” correctly since it includes the strong signal “speak”. This context is a case of selectional restrictions (Resnik, 1993; Jurafsky and Martin, 2009), in this case possible objects of “speak”.
As local features in small context sizes already bring noticeable improvements on s-class inference performance, we hypothesize that it may not be necessary to exploit the full context in (e.g., mobile) applications where the quadratic complexity of full-sentence attention is problematic.
P-BERT shows a similar pattern when comparing context sizes. However, large context sizes such as 16 and 32 hurt performance, meaning that averaging the embeddings of too many context words swamps the identity of the annotated target word.
Comparing BERT layers. Higher layers of BERT yield better contextualized word embeddings. This phenomenon is more noticeable for large context sizes such as 8, 16 and 32. However for small context sizes, e.g., 0, embeddings from all layers perform similarly and badly. This means that without context information, simply passing the wordpiece embedding of a word through BERT layers does not help, suggesting that contextualization is the key factor for the BERT’s impressive s-class inference performance.
Again, P-BERT only outperforms layer 0 of BERT with most context sizes, suggesting that BERT layers, especially the top layers, produce abstract and informative features for accurate s-class inference, instead of naively considering all information within the context sentence.
4.4 Probing finetuned embeddings
Up to now, we have done “classical” probing: we extract features from weight-frozen pretrained BERT and feed them to diagnostic classifiers. Now we turn to the question: how do pretrained knowledge and probed features change when BERT is finetuned on downstream tasks?
4.4.1 Finetuning tasks
We finetune BERT on three tasks: part-of-speech (POS) tagging on the Penn Treebank (Marcus et al., 1993), binary sentiment classification on the Stanford Sentiment Treebank (SST-2) (Socher et al., 2013) and paraphrase detection on the Microsoft Research Paraphrase Corpus (MRPC) (Dolan and Brockett, 2005). For SST-2 and MRPC, we use the GLUE train and dev sets (Wang et al., 2018). For POS, sections 0-18 of WSJ are train and sections 19-21 are dev (Collins, 2002).
Following Devlin et al. (2019)
, we put a softmax layer on top of the pretrained BERT, then finetune all trainable parameters. We use Adam(Kingma and Ba, 2014)
with learning rate 5e-5 for 5 epochs. We save the model from the step that performs best on dev (of POS/SST-2/MRPC), extract representations from Wiki-PSE using this model and then report results on Wiki-PSE dev. Our models’ performance is close toDevlin et al. (2019).
4.4.2 Probing results
We now quantify the contextualization of word representations from finetuned BERT models. Two setups are considered: (a) directly apply the MLPs in §4.2 (trained with pretrained embeddings) to embeddings computed by finetuned BERT; (b) train and evaluate a new set of MLPs on the embeddings computed by finetuned BERT.
Comparing (a) with probing results on pretrained BERT (§4.2) gives us an intuition about how much change occurred to the knowledge captured during pretraining. Comparing (b) with §4.2 reveals whether or not the pretrained knowledge is still preserved in finetuned models.
Figure 4 shows s-class probing results on Wiki-PSE dev of pretrained BERT (“Pretrained”), POS-, SST-2- and MRPC-finetuned BERT in setups (a) and (b); e.g., layer 11 s-class inference performance of the POS-finetuned BERT decreases by 0.763 (0.835 0.072, from “Pretrained” to “POS-(a)”) when using the MLPs from §4.2 (setup (a)).
Comparing finetuning tasks. Finetuning BERT on SST-2 and MRPC introduces much smaller performance decreases than for POS and this holds for both (a) and (b). For setup (b), finetuning BERT on MRPC introduces small but consistent improvements on s-class inference. S-class inference, SST-2 and MRPC are somewhat similar tasks, all concerned with semantics; this likely explains this limited performance change (Mou et al., 2016). On the other hand, the syntactic task POS seems to encourage BERT to “propagate” the syntactic information that is represented on lower layers to the upper layers; due to their limited capacity, the fixed-size vectors in the upper layers then lose semantic information and a more significant performance drop occurs on s-class probing (see green dotted line).
Comparing BERT layers. Contextualized embeddings from BERT’s top layers are strongly affected by finetuning, especially for setup (a). In contrast, lower layers are more invariant and show s-class inference results similar to the pretrained model. Hao et al. (2019), Lee et al. (2019), Kovaleva et al. (2019) make similar observations: contextualized embeddings from lower layers are more transferable across different tasks and contextualized embeddings from top layers are more task-specific after finetuning.
Figure 5 shows the cosine similarity of the flattened self attention weights computed by pretrained, POS- and MRPC-finetuned BERT. We see that top layers are more sensitive to finetuning (darker color) while lower layers are barely changed (lighter color). Top layers have much larger changes for POS than for MRPC, in line with probing results in Figure 4.
Comparing (a) and (b), we see that the knowledge captured during pretraining is still preserved to some extent after finetuning. For example, the MLPs trained with layer 11 embeddings computed by the POS-finetuned BERT achieve a reasonably good score of 0.735 (a 0.100 drop compared with “Pretrained” – compare black and green dotted lines in Figure 4). Thus, the semantic information needed for inferring s-classes is still present to a large extent. Finetuning on a task may contribute to large changes in the representation – similar to the projection utilized to uncover divergent information in uncontextualized word embeddings (Artetxe et al., 2018) – but relatively little information is lost as the good performance of the newly trained MLPs shows.
5 Related Work
Interpreting deep networks. Pretrained language models (McCann et al., 2017; Peters et al., 2018a; Radford et al., 2019; Devlin et al., 2019) advance NLP by contextualized token-level representations of words. A key goal of current research is to understand how these models work and what they represent on different layers. Interpretation through, e.g., feature visualization (Maaten and Hinton, 2008), attention analysis (Xu et al., 2015), and gradient inspection (Shrikumar et al., 2017) helps in better understanding their impressive performance.
Probing tasks. Probing is a recent strand of work that investigates – via diagnostic classifiers – desired syntactic and semantic features encoded in pretrained language model representations. Shi et al. (2016) show that string-based RNNs encode syntactic information. Belinkov et al. (2017) investigate word representations at different layers in NMT. Linzen et al. (2016) assess the syntactic ability of LSTM (Hochreiter and Schmidhuber, 1997) encoders and Goldberg (2019) of BERT. Tenney et al. (2019a) find that information on POS tagging, parsing, NER, semantic roles and coreference is represented on progressively higher layers of BERT. Yaghoobzadeh et al. (2019) assess the disambiguation properties of type-level word representations. Liu et al. (2019) and Lin et al. (2019) investigate the linguistic knowledge encoded in BERT. Adi et al. (2016), Conneau et al. (2018) and Wieting and Kiela (2019) study sentence embedding properties via probing. Peters et al. (2018b) probe how the network architecture affects the learned vectors. In all of these studies, probing serves to analyze representations and reveal their properties. We employ probing to investigate the contextualization of words in pretrained language models quantitatively.
Ethayarajh (2019) also quantitatively investigates contextualized embeddings using cosine-similarity-based intrinsic evaluation. Inferring s-classes, we address a complementary set of questions because we can quantify contextualization with a uniform set of semantic classes.
: pretrain a language encoder on a large amount of unlabeled data via self-supervised learning, then finetune the encoder on task-specific benchmarks like GLUE(Wang et al., 2018, 2019; Peters et al., 2019)
. This transfer-learning pipeline yields good and robust results compared to models trained from scratch(Hao et al., 2019). In this work, we shed light on how BERT’s pretrained knowledge changes during finetuning by comparing s-class inference ability of pretrained and finetuned BERT models.
We presented a quantitative study of the contextualization of words in BERT by investigating the semantic class inference capabilities of BERT. Our study complements qualitative prior work (e.g., Tenney et al. (2019a)) and the cosine-similarity-based intrinsic evaluations in Ethayarajh (2019). We investigated two key factors for successful contextualization by BERT: layer index and context size. By comparing pretrained and finetuned BERT, we showed that the similarity between pretraining and finetuning objectives heavily influences contextualized embeddings; top layers of BERT are more sensitive to the finetuning objective than lower layers. We also found that BERT’s pretrained knowledge is still preserved to a large extent after finetuning.
We showed that exploiting the full context may be unnecessary in applications where the quadratic complexity of full-sentence attention is problematic. We plan to evaluate this phenomenon on more datasets and downstream tasks in future work.
- Adi et al. (2016) Yossi Adi, Einat Kermany, Yonatan Belinkov, Ofer Lavi, and Yoav Goldberg. 2016. Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. arXiv preprint arXiv:1608.04207.
- Artetxe et al. (2018) Mikel Artetxe, Gorka Labaka, Iñigo Lopez-Gazpio, and Eneko Agirre. 2018. Uncovering divergent linguistic information in word embeddings with lessons for intrinsic and extrinsic evaluation. In Proceedings of the 22nd Conference on Computational Natural Language Learning, pages 282–291, Brussels, Belgium. Association for Computational Linguistics.
- Belinkov and Glass (2019) Yonatan Belinkov and James Glass. 2019. Analysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics, 7:49–72.
Belinkov et al. (2017)
Yonatan Belinkov, Lluís Màrquez, Hassan Sajjad, Nadir Durrani, Fahim
Dalvi, and James Glass. 2017.
of representation in neural machine translation on part-of-speech and
semantic tagging tasks.
In Proceedings of the Eighth International Joint Conference on
Natural Language Processing (Volume 1: Long Papers)
, pages 1–10, Taipei, Taiwan. Asian Federation of Natural Language Processing.
- Beviá et al. (2007) Rubén Izquierdo Beviá, Armando Suárez Cueto, and Germán Rigau Claramunt. 2007. Exploring the automatic selection of basic level concepts.
- Black (1988) Ezra Black. 1988. An experiment in computational discrimination of english word senses. IBM Journal of research and development, 32(2):185–194.
- Bojanowski et al. (2017) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.
- Ciaramita and Johnson (2003) Massimiliano Ciaramita and Mark Johnson. 2003. Supersense tagging of unknown nouns in wordnet. In Proceedings of the 2003 conference on Empirical methods in natural language processing, pages 168–175. Association for Computational Linguistics.
Clark et al. (2019)
Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019.
What does BERT look
at? an analysis of BERT’s attention.
Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 276–286, Florence, Italy. Association for Computational Linguistics.
- Collins (2002) Michael Collins. 2002. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), pages 1–8. Association for Computational Linguistics.
- Conneau et al. (2018) Alexis Conneau, German Kruszewski, Guillaume Lample, Loïc Barrault, and Marco Baroni. 2018. What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2126–2136, Melbourne, Australia. Association for Computational Linguistics.
- Dai and Le (2015) Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 3079–3087. Curran Associates, Inc.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Dolan and Brockett (2005) William B Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005).
- Edmonds and Cotton (2001) Philip Edmonds and Scott Cotton. 2001. SENSEVAL-2: Overview. In Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems, pages 1–5, Toulouse, France. Association for Computational Linguistics.
- Ethayarajh (2019) Kawin Ethayarajh. 2019. How contextual are contextualized word representations? comparing the geometry of BERT, ELMo, and GPT-2 embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 55–65, Hong Kong, China. Association for Computational Linguistics.
- Fellbaum (1998) Christiane Fellbaum. 1998. WordNet: An Electronic Lexical Database. Bradford Books.
- Gale et al. (1992) William A Gale, Kenneth W Church, and David Yarowsky. 1992. One sense per discourse. In Proceedings of the workshop on Speech and Natural Language, pages 233–237. Association for Computational Linguistics.
- Goldberg (2019) Yoav Goldberg. 2019. Assessing bert’s syntactic abilities. arXiv preprint arXiv:1901.05287.
- Hao et al. (2019) Yaru Hao, Li Dong, Furu Wei, and Ke Xu. 2019. Visualizing and understanding the effectiveness of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4134–4143, Hong Kong, China. Association for Computational Linguistics.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
- Howard and Ruder (2018) Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339, Melbourne, Australia. Association for Computational Linguistics.
- Izquierdo et al. (2009) Rubén Izquierdo, Armando Suárez, and German Rigau. 2009. An empirical study on class-based word sense disambiguation. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pages 389–397. Association for Computational Linguistics.
- Jurafsky and Martin (2009) Daniel Jurafsky and James H. Martin. 2009. Speech and Language Processing (2Nd Edition). Prentice-Hall, Inc., Upper Saddle River, NJ, USA.
- Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Kohomban and Lee (2005) Upali Sathyajith Kohomban and Wee Sun Lee. 2005. Learning semantic classes for word sense disambiguation. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 34–41, Ann Arbor, Michigan. Association for Computational Linguistics.
- Kovaleva et al. (2019) Olga Kovaleva, Alexey Romanov, Anna Rogers, and Anna Rumshisky. 2019. Revealing the dark secrets of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4356–4365, Hong Kong, China. Association for Computational Linguistics.
- Lee et al. (2019) Jaejun Lee, Raphael Tang, and Jimmy Lin. 2019. What would elsa do? freezing layers during transformer fine-tuning.
- Levine et al. (2019) Yoav Levine, Barak Lenz, Or Dagan, Dan Padnos, Or Sharir, Shai Shalev-Shwartz, Amnon Shashua, and Yoav Shoham. 2019. Sensebert: Driving some sense into bert. arXiv preprint arXiv:1908.05646.
- Lin et al. (2019) Yongjie Lin, Yi Chern Tan, and Robert Frank. 2019. Open sesame: Getting inside bert’s linguistic knowledge. arXiv preprint arXiv:1906.01698.
- Linzen et al. (2016) Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. 2016. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics, 4:521–535.
- Liu et al. (2019) Nelson F. Liu, Matt Gardner, Yonatan Belinkov, Matthew E. Peters, and Noah A. Smith. 2019. Linguistic knowledge and transferability of contextual representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1073–1094, Minneapolis, Minnesota. Association for Computational Linguistics.
Maaten and Hinton (2008)
Laurens van der Maaten and Geoffrey Hinton. 2008.
Visualizing data using t-sne.
Journal of machine learning research, 9(Nov):2579–2605.
- Magnini and Cavaglià (2000) Bernardo Magnini and Gabriela Cavaglià. 2000. Integrating subject field codes into WordNet. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00), Athens, Greece. European Language Resources Association (ELRA).
- Marcus et al. (1993) Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330.
- McCann et al. (2017) Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems, pages 6294–6305.
- Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
- Mou et al. (2016) Lili Mou, Zhao Meng, Rui Yan, Ge Li, Yan Xu, Lu Zhang, and Zhi Jin. 2016. How transferable are neural networks in NLP applications? In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 479–489, Austin, Texas. Association for Computational Linguistics.
- Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.
- Peters et al. (2018a) Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018a. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
- Peters et al. (2018b) Matthew Peters, Mark Neumann, Luke Zettlemoyer, and Wen-tau Yih. 2018b. Dissecting contextual word embeddings: Architecture and representation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1499–1509, Brussels, Belgium. Association for Computational Linguistics.
- Peters et al. (2019) Matthew E. Peters, Sebastian Ruder, and Noah A. Smith. 2019. To tune or not to tune? adapting pretrained representations to diverse tasks. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pages 7–14, Florence, Italy. Association for Computational Linguistics.
- Pinter et al. (2017) Yuval Pinter, Robert Guthrie, and Jacob Eisenstein. 2017. Mimicking word embeddings using subword RNNs. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 102–112, Copenhagen, Denmark. Association for Computational Linguistics.
- Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8).
- Raganato et al. (2017) Alessandro Raganato, Jose Camacho-Collados, and Roberto Navigli. 2017. Word sense disambiguation: A unified evaluation framework and empirical comparison. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 99–110, Valencia, Spain. Association for Computational Linguistics.
- Resnik (1993) Philip Resnik. 1993. Semantic classes and syntactic ambiguity. In HUMAN LANGUAGE TECHNOLOGY: Proceedings of a Workshop Held at Plainsboro, New Jersey, March 21-24, 1993.
- Schick and Schütze (2019) Timo Schick and Hinrich Schütze. 2019. Rare words: A major problem for contextualized embeddings and how to fix it by attentive mimicking. arXiv preprint arXiv:1904.06707.
- Schuster et al. (2019) Tal Schuster, Ori Ram, Regina Barzilay, and Amir Globerson. 2019. Cross-lingual alignment of contextual word embeddings, with applications to zero-shot dependency parsing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1599–1613, Minneapolis, Minnesota. Association for Computational Linguistics.
- Shi et al. (2016) Xing Shi, Inkit Padhi, and Kevin Knight. 2016. Does string-based neural MT learn source syntax? In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1526–1534, Austin, Texas. Association for Computational Linguistics.
- Shrikumar et al. (2017) Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. 2017. Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3145–3153. JMLR. org.
- Snyder and Palmer (2004) Benjamin Snyder and Martha Palmer. 2004. The English all-words task. In Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, pages 41–43, Barcelona, Spain. Association for Computational Linguistics.
- Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
- Strubell et al. (2019) Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3645–3650, Florence, Italy. Association for Computational Linguistics.
- Tenney et al. (2019a) Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019a. BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601, Florence, Italy. Association for Computational Linguistics.
- Tenney et al. (2019b) Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R Thomas McCoy, Najoung Kim, Benjamin Van Durme, Samuel R Bowman, Dipanjan Das, et al. 2019b. What do you learn from context? probing for sentence structure in contextualized word representations. arXiv preprint arXiv:1905.06316.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
- Voita et al. (2019) Elena Voita, Rico Sennrich, and Ivan Titov. 2019. The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4387–4397, Hong Kong, China. Association for Computational Linguistics.
- Wang et al. (2019) Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. arXiv preprint arXiv:1905.00537.
- Wang et al. (2018) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
- Wieting and Kiela (2019) John Wieting and Douwe Kiela. 2019. No training required: Exploring random encoders for sentence classification. arXiv preprint arXiv:1901.10444.
- Wolf et al. (2019) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Transformers: State-of-the-art natural language processing.
- Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
- Xu et al. (2015) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057.
- Yaghoobzadeh et al. (2019) Yadollah Yaghoobzadeh, Katharina Kann, T. J. Hazen, Eneko Agirre, and Hinrich Schütze. 2019. Probing for semantic classes: Diagnosing the meaning content of word embeddings. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5740–5753, Florence, Italy. Association for Computational Linguistics.
- Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. ArXiv, abs/1906.08237.
- Yarowsky (1992) David Yarowsky. 1992. Word-sense disambiguation using statistical models of Roget’s categories trained on large corpora. In COLING 1992 Volume 2: The 15th International Conference on Computational Linguistics.