Log In Sign Up

Integrating Linguistic Theory and Neural Language Models

by   Bai Li, et al.

Transformer-based language models have recently achieved remarkable results in many natural language tasks. However, performance on leaderboards is generally achieved by leveraging massive amounts of training data, and rarely by encoding explicit linguistic knowledge into neural models. This has led many to question the relevance of linguistics for modern natural language processing. In this dissertation, I present several case studies to illustrate how theoretical linguistics and neural language models are still relevant to each other. First, language models are useful to linguists by providing an objective tool to measure semantic distance, which is difficult to do using traditional methods. On the other hand, linguistic theory contributes to language modelling research by providing frameworks and sources of data to probe our language models for specific aspects of language understanding. This thesis contributes three studies that explore different aspects of the syntax-semantics interface in language models. In the first part of my thesis, I apply language models to the problem of word class flexibility. Using mBERT as a source of semantic distance measurements, I present evidence in favour of analyzing word class flexibility as a directional process. In the second part of my thesis, I propose a method to measure surprisal at intermediate layers of language models. My experiments show that sentences containing morphosyntactic anomalies trigger surprisals earlier in language models than semantic and commonsense anomalies. Finally, in the third part of my thesis, I adapt several psycholinguistic studies to show that language models contain knowledge of argument structure constructions. In summary, my thesis develops new connections between natural language processing, linguistic theory, and psycholinguistics to provide fresh perspectives for the interpretation of language models.


A Precis of Language Models are not Models of Language

Natural Language Processing is one of the leading application areas in t...

How is BERT surprised? Layerwise detection of linguistic anomalies

Transformer language models have shown remarkable ability in detecting w...

Visually Analyzing Contextualized Embeddings

In this paper we introduce a method for visually analyzing contextualize...

Language Models as an Alternative Evaluator of Word Order Hypotheses: A Case Study in Japanese

We examine a methodology using neural language models (LMs) for analyzin...

Neural reality of argument structure constructions

In lexicalist linguistic theories, argument structure is assumed to be p...

Modeling Interpersonal Linguistic Coordination in Conversations using Word Mover's Distance

Linguistic coordination is a well-established phenomenon in spoken conve...

Measure More, Question More: Experimental Studies on Transformer-based Language Models and Complement Coercion

Transformer-based language models have shown strong performance on an ar...

1.1 Motivation: why linguistic probing?

At first glance, it is not immediately obvious why linguistic theory is necessary or desirable for understanding language models. Progress in NLP is usually measured by a set of standard benchmarks, such as SQuAD 2 (Rajpurkar et al., 2018) for question answering or SuperGLUE (Wang et al., 2019a) for various types of language understanding. Using these benchmarks, researchers can compare their models against previous ones, and public leaderboards rank all currently available models by their relative performance (often with a human baseline for comparison). Practitioners can use these benchmarks to decide which model to apply for their own use case (possibly along with other considerations such as model size and efficiency, which in any case can also be measured in benchmarks). One may wonder: if benchmarks already serve the needs of researchers and practitioners, why do we need to involve linguistic theory?

Benchmarks suffer from several drawbacks in practice. They become “saturated”, where models quickly surpass human baseline on the benchmark, even though they do not outperform humans in general: this leads to a loss of trust in the benchmark’s validity and hinders further progress in the field (Bowman and Dahl, 2021). Both SQuAD 2 and SuperGLUE had models that exceeded human performance within months of their release, even though they were designed to be difficult tasks, and neither question answering nor text classification are generally considered solved. This can happen when models learn to exploit biases in the benchmark dataset that enable it to predict the correct answer in a way that is unintended (and incorrect): for example, predicting that two sentences are contradictions if a negation word is present (Gururangan et al., 2018). Eliminating all such biases is difficult because they are annotated by crowdworkers on naturally occurring data, and both tend to contain biases. Moreover, in most language understanding tasks, we lack a precise understanding of how the correct answer is related to the inputs of a task instance, so annotations must rely on the possibly differing interpretations of human annotators.

Targeted linguistic evaluation offers a remedy to these limitations of benchmarking. It draws on decades of research aimed at describing language in as precise detail as possible: which sentences are grammatical and ungrammatical, and how the meaning of a sentence is related to its surface structure. This body of knowledge gives the researcher the ability to control for biases and eliminate shortcut heuristics, revealing deficiencies in our models’ understanding of many language phenomena, such as negation

(Ettinger, 2020), phrasal composition (Yu and Ettinger, 2020), and long-range dependencies (van Schijndel et al., 2019).

Probing tasks can be derived from linguistic theory via templates: generating sentences of a given structure according to a theoretical description of some language feature. An example of this is BLiMP (Warstadt et al., 2020a), a benchmark of 67 sets of template-generated sentences testing linguistic phenomena such as agreement and movement. A second and more direct approach is taking data from psycholinguistic publications to use as probing – these sentences are written by linguists for human experimental stimuli and are carefully controlled for possible biases. I use both techniques extensively to obtain probing data in Chapters 5 and 6 of this thesis.

Although linguistic probing offers certain advantages over standard NLP benchmarks, they are not meant to be a replacement for benchmarking – the two serve different needs of the research community. The crucial difference is that probing aims to deepen our understanding of existing widely used language models, whereas developing new models to achieve a high performance on the probing task is of lesser importance. Therefore, linguistic probing can be viewed as a form of post-hoc interpretability, offering a global view of the model’s capabilities by identifying areas of weakness from the perspective of linguistic theory (Madsen et al., 2021).

1.2 Bridging NLP and linguistics

Figure 1.1: The main contributions of this thesis. Chapter 4 on word class flexibility uses LMs as evidence for a debate in linguistic theory; Chapter 5 on linguistic anomalies and Chapter 6 on construction grammar apply linguistic frameworks and data toward LM probing.

There has been relatively little contact between natural language processing and theoretical linguistics since the deep learning revolution. While the two fields share some similarities – both involve data in human languages – their primary goals are different. NLP aims to develop computational systems to solve language tasks as accurately as possible, whereas linguistics aims to describe properties of languages and how humans process them. Given these divergent goals, it is not obvious how advances in either field should be relevant to the other. Some researchers have attempted to incorporate linguistic and structural knowledge into deep neural models, but these methods have not shown substantial performance improvements over models without linguistic knowledge

(Lappin, 2021). The likely reason for this failure is that language models are able to learn implicit structural properties of language through the usual training procedure (Hewitt and Manning, 2019; Miaschi et al., 2020), so providing explicit knowledge is redundant.

Instead of improving model performance directly, the more promising avenue for linguistics to contribute to NLP has been linguistic probing. Previous work has probed LMs for knowledge of many linguistic phenomena, which I will survey in Chapter 3. In this dissertation, I expand this body of work by studying two linguistic phenomena which have not been covered in earlier work: how LMs represent different types of linguistic anomalies (Chapter 5), and how they understand argument structure constructions (Chapter 6).

In the opposite direction, it has been even more difficult for deep learning to contribute to linguistic theory. Neural network probing experiments do not provide information about how humans process language, and as a result, this body of work has been rarely cited in linguistic publications (Baroni, 2021). My research offers an avenue of contribution in this direction: using deep models as a tool to measure semantic distance between occurrences of a word in different contexts (Chapter 4). Semantic distance is a metric of importance for theories of word class flexibility, but is difficult to measure using traditional methods. My three projects in this thesis serve to bridge the gap between NLP and linguistics and provide examples of interdisciplinary collaboration for both research communities.

1.3 The syntax-semantics interface in language models

A traditional dichotomy in the study of language is between syntax and semantics. Syntax studies well-formed phrases and their internal structures, whereas semantics studies how meaning is derived from syntactic structures. Many linguistic phenomena involve the interplay between syntax and semantics, often with complex effects that pose challenging problems for linguistic theories. My thesis focuses on three phenomena on the interface between syntax and semantics, and explores these phenomena with novel experimental methods involving language models.

In Chapter 4, I study the phenomenon of word class flexibility, where certain “flexible” words may be used as different parts of speech: for example, in English, the words work or sleep may be used either as a noun or a verb. Whether all languages possess a distinction between nouns and verbs is a controversial question in linguistic typology, due to competing definitions of word classes. Word classes are defined by a combination of morphosyntactic distribution (e.g., only nouns may follow a determiner in English), and semantic criteria (e.g., nouns are associated with objects and verbs with actions; Van Lier and Rijkhoff (2013)). However, linguistic theories disagree over the analysis of flexible words: are they lexemes with underspecified word class, or cases of conversion, whereby a lexeme undergoes a derivational process into a different word class? My work presents evidence supporting the conversion theory, using measures of semantic variability.

Next, in Chapter 5, I explore sentences containing syntactic, semantic, and commonsense violations. Various linguistic theories have proposed differences between syntactic and semantic violations. In generative syntax, Chomsky (1957) proposed that ungrammaticality (e.g., “furiously sleep ideas green colorless”) should be distinguished from semantic anomalies (e.g., “colorless green ideas sleep furiously”): the latter being a meaningless but grammatically well-formed sentence. In psycholinguistics, studies found that semantic violations triggered the N400 event-related potential (ERP) in the brain whereas morphosyntactic violations triggered a different P600 ERP, although this dichotomy was abandoned after further evidence (Kutas et al., 2006). Inspired by this psycholinguistic work, I probe language models for whether they exhibit any differences in internal processing in response to varying types of anomalies.

Finally, in Chapter 6, I adopt a construction grammar framework to probe language models. Construction grammar proposes that all linguistic knowledge is encoded in constructions, which map between form and meaning, and there is no separation between lexical and grammatical knowledge. Argument structure constructions (ASCs) theorize that certain syntactic patterns are associated with semantic meaning independently of the main verb (Goldberg, 1995). For example, the ditransitive construction (e.g., “Bob cut Joe the bread”) is associated with the transfer of the indirect object to the direct object recipient, no matter which verb is used. In contrast, lexicalist theories assume that the main verb is responsible for assigning semantic roles to each of its syntactic arguments. In psycholinguistics, sentence sorting and priming studies supported the theory of ASCs in humans; I adapt several of these studies to show evidence for ASCs in language models as well.

1.4 Structure of thesis

The rest of my thesis is structured as follows.

  • Chapter 2 gives a survey of modern neural language models, including static word embeddings based on the distributional hypothesis, RNN and LSTM sequence models, and Transformer-based LMs such as BERT and GPT. I also discuss the benefits and tradeoffs of some common diagnostic classifier schemes used for probing LMs.

  • Chapter 3 surveys the recent literature connecting linguistic theories to LMs, including behavioural probes of syntactic structure and probes targeting the internal representations of LMs. Next, I discuss neural network probing methods adapted from psycholinguistics, and applications of LMs to linguistic theory.

  • Chapter 4 presents my cross-lingual study on word class flexibility: my method leverages contextual embeddings from LMs as a source of automated semantic distance judgments, and I find evidence supporting the theoretical view of word class flexibility as a directional process.

  • Chapter 5 presents my work on probing for linguistic anomalies within intermediate layers of LMs. Inspired by neurolinguistic event-related potential (ERP) studies, I propose an anomaly detection method based on Gaussian models and find that different internal activation patterns are triggered in response to different types of linguistic anomalies.

  • Chapter 6 presents my work on probing LMs for construction grammar: specifically, a family of constructions known as argument structure constructions. I adapt several human psycholinguistic studies to be suitable for LMs, and show that LMs exhibit knowledge of argument structure constructions similarly to humans.

  • Chapter 7 concludes my thesis by summarizing my contributions and highlighting some areas for future improvement.

1.5 Relationship to published work

Several chapters of this thesis have previously appeared in peer-reviewed publications:

  • Chapter 4. Bai Li, Guillaume Thomas, Yang Xu, and Frank Rudzicz. “Word class flexibility: A deep contextualized approach”. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020).

  • Chapter 5. Bai Li, Zining Zhu, Guillaume Thomas, Yang Xu, and Frank Rudzicz. “How is BERT surprised? Layerwise detection of linguistic anomalies”. Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021).

  • Chapter 6. Bai Li, Zining Zhu, Guillaume Thomas, Frank Rudzicz, and Yang Xu. “Neural reality of argument structure constructions”. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022).

Additionally, the following peer-reviewed publications are not included in this thesis but were published during my doctorate:

  • Bai Li, Nanyi Jiang, Joey Sham, Henry Shi, and Hussein Fazal. “Real-world Conversational AI for Hotel Bookings”.

    IEEE Annual Conference on Artificial Intelligence for Industries (AI4I 2019)


  • Bai Li, Jing Yi Xie, and Frank Rudzicz. “Representation Learning for Discovering Phonemic Tone Contours”. 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology (SIGMORPHON at ACL 2020).

  • Bai Li and Frank Rudzicz. “TorontoCL at CMCL 2021 Shared Task: RoBERTa with Multi-Stage Fine-Tuning for Eye-Tracking Prediction”. Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics (CMCL at NAACL 2021).

  • Zining Zhu, Jixuan Wang, Bai Li, and Frank Rudzicz. “On the data requirements of probing”. Findings of the Association for Computational Linguistics: ACL 2022.

2.1 Introduction

Modern natural language processing has achieved much of its success by leveraging a relatively small number of foundational models that serve as building blocks for many diverse applications. These models are trained on vast amounts of unstructured language data, allowing them to learn general-purpose representations of human language and transfer this knowledge towards downstream tasks. The architectures of these models have evolved over the years with research progress: in Section 2.2, I begin with static word vector models based on distributional semantics, which are typically used in the input layer in sequential RNN and LSTM neural networks (Section 2.3). Next, in Section 2.4, I cover contextualized language models such as BERT, which use a two-step procedure in which the model is pre-trained on large amounts of text, then fine-tuned to perform specific tasks. Finally, in Section 2.5, I describe some methods to probe the internal representations of large language models and the difficulties and shortcomings of these probing methodologies.

2.2 Distributional semantics and word embeddings

Figure 2.1: Visualizing distributional semantic models. Each individual dimension is semantically meaningless, but similar words are closer together in the vector space, as measured by Euclidean or cosine distance. Figure adapted from Boleda (2020).

Word embeddings (or vectors) are one of the most basic building blocks of natural language processing, encoding the meaning of a word into a fixed-dimensional vector of real numbers. These vectors are learned using large corpus data, and individual dimensions of these vectors do not contain any meaning; the meaning of words is captured by their geometric relationship to other word vectors. Words that are related in meaning are close together in vector space (measured by Euclidean or cosine distance), while words that are unrelated are farther apart (Figure 2.1). Sometimes, word vectors exhibit vector arithmetic properties: for instance, the vector difference between Canada and Ottawa is close to the vector difference of China and Beijing because both word pairs exhibit the country-capital relationship.

The theoretical basis for word embeddings is from distributional semantics. Harris (1954) proposed the distributional hypothesis: words which occur in similar contexts have similar meaning, and conversely, a word’s meaning can be defined by its contextual distribution. For example, given the sentence “The ___ stayed up all night to finish her paper”, two likely completions for the blank are student and postdoc, so these two words have similar meaning according to distributional semantics.

There are many word embedding models based on the distributional hypothesis; one of the most popular and first neural approaches was word2vec (Mikolov et al., 2013)

. Word2vec used a shallow neural network to either predict a word given its context (the continuous bag-of-words model), or predict the context of a word (the skip-gram model). In the continuous bag-of-words variant, the neural network is given a one-hot encoding of words in a context window surrounding it, and uses a classification layer to predict the middle word. After training, the word embeddings are derived from the hidden layer of this neural network.

A popular alternative method to learn word vectors was GLoVE (Pennington et al., 2014). Rather than predicting word identities from context, GLoVE learns vectors that reconstruct a word-to-word co-occurrence matrix, where each entry in the matrix is the count of how many times that pair of words co-occurred in the same context window in a corpus. Practically, word2vec and GLoVE produce word vectors that behave similarly, and neither one consistently outperforms the other.

Both word2vec and GLoVE have a weakness in that they treat each word as an atomic unit, so the internal structure of words is not utilized, and they cannot handle out-of-vocabulary words. Fasttext (Bojanowski et al., 2017)

proposed to enrich the word2vec skip-gram method with sub-word information: in addition to learning a vector for each word, Fasttext also learns vectors for character n-grams so that an out-of-vocabulary word can be represented by the sum of its character n-gram vectors.

Word embeddings have some weaknesses that limit their utilities in some situations. First, since they always generate a static vector for each word, they cannot capture homographs which have the same form but multiple senses. For example, the noun and verb senses of the word bear are completely unrelated, but both are assigned the same word embedding. The result is a less-than-ideal representation for both senses, since the optimization procedure produces an embedding that lies somewhere in between the two senses in an attempt to capture both simultaneously. In the next section, I discuss contextual embeddings which generate better representations of polysemous words. Another weakness is that words that have similar vectors do not always have similar meanings: one can only conclude that they occur in similar distributional contexts. For example, the sentence “The test was very ___”, both easy and difficult are reasonable completions, despite having opposite meanings. Thus, it is generally difficult to distinguish synonyms from antonyms and hypernyms using vector space distance (Lenci, 2018).

2.3 Sequential models and contextual embeddings

Word embeddings provide representations for individual words, but they are unable to represent word order in longer texts such as sentences or paragraphs. In order to work with text, sequential models such as recurrent neural networks (RNNs;

Elman (1990)) are needed. An RNN is a type of neural network that reads sequential input one token at a time; the input may consist of one-hot encodings of words or lower-dimensional static word embeddings. The RNN keeps track of a hidden state representing the input read thus far and updates it after reading each token. The final state may be used as a representation for the entire sequence.

Figure 2.2: A recurrent neural network (RNN) for language modelling. At each step, the RNN predicts the next word in the sequence, given a hidden state that is derived from the previous words. The inputs may be one-hot encodings of the words, or static word embeddings such as word2vec or GLoVE.

A popular variant of the RNN is long short-term memory networks (LSTMs;

Hochreiter and Schmidhuber (1997)). Plain RNNs have difficulty processing long-range dependencies in text since information in the hidden layer must be retained across many steps. In contrast, LSTMs use a memory cell in addition to the hidden state and a system of input, output, and forget gates to modify the memory cell, allowing the model to more easily retain long-term information. Sometimes, it is useful to read input in both directions: a bidirectional LSTM (bi-LSTM) consists of a forward and backward LSTM whose outputs are concatenated together.

RNNs and LSTMs are used in several common setups for supervised prediction. For sequence classification, such as predicting the topic of a sentence, the last hidden representation may be fed into a discrete classification layer. When the task requires per-token classification, such as part-of-speech tagging, the hidden representation at each token can be fed into a classification layer. When the output is a sequence, such as machine translation, the last hidden representation may be fed into a decoder RNN or LSTM, which produces a sequential output.

A common unsupervised use case for sequential models is language modelling. In language modelling, the model is given a sequence of words and predicts which word is likely to come next; after training on a corpus of unlabelled text, the model can then be used to generate similar sequences. Language models also assign a probability to any given sequence: the

perplexity is defined as the negative log likelihood of this probability:

where each is a word in the sentence . A related concept, surprisal, is the negative log likelihood of a single token given previous context:

. Perplexity is used as an unsupervised evaluation metric measuring how well a language model fits a corpus; it can also be used to rank sentences for relative acceptability according to the language model. A fair comparison using perplexity requires that the models must have the same vocabulary and the sentences must be the same length.

When trained for language modelling, neural models learn hidden representations that capture general properties of language that are useful for many tasks; we call these language models

(LMs). This approach benefits from leveraging large amounts of unlabelled data, which is available in much larger quantities compared to labelled data required for supervised learning. Once trained, transfer learning can be applied to fine-tune the LM to specific tasks. Universal Language Model Fine-tuning (ULMFiT;

Howard and Ruder (2018)) was the earliest model to use the transfer learning paradigm that is now dominant in natural language processing. ULMFiT first trained an LSTM model on a large corpus (the pre-training step), followed by a small amount of additional training on task-specific data (the fine-tuning step); the model used a slanted triangular learning rate schedule and gradual unfreezing of layers to prevent forgetting pre-trained information during the fine-tuning stage.

Another transfer method using unsupervised learning is Embeddings from Language Models (ELMo). ELMo trained a multilayer bidirectional LSTM on a corpus, then uses the hidden representation of the forward and backward LSTMs concatenated together as contextual embeddings. That is, rather than mapping each word to a static word vector, ELMo generates a sequence of contextual vectors for a sequence of tokens; these contextual vectors can then be substituted for any model that assumes a sequence of word vectors as input. By incorporating information from surrounding words, contextual vectors are able to generate different representations for different senses of a word like

bear, overcoming the word sense ambiguity problem in static word vectors.

2.4 Transformer-based language models

Figure 2.3: The BERT model (Devlin et al., 2019), consisting of 12 layers of Transformer modules (Vaswani et al., 2017). The model is first pre-trained on masked language modelling and next sentence prediction, then fine-tuned for downstream tasks. BERT may take either a single sentence as input, or two sentences separated by a [SEP] token.

The next major architectural innovation after LSTMs was the Transformer module (Vaswani et al., 2017). Contrary to sequential models which process tokens one at a time, Transformers use the self-attention mechanism to process the entire sequence simultaneously; each token is associated with a positional encoding to retain word order information. The advantage of the Transformer architecture over RNNs and LSTMs is that it avoids problems with long-range dependencies where information needs to be carried forward across many steps: with self-attention, the maximum path length is constant (equal to the number of layers in the model), and does not depend on the distance between the input words. The Transformer module was originally employed for machine translation, using a stack of 6 encoder and 6 decoder Transformer layers; it subsequently replaced recurrent layers in many neural models.

Transformers were soon applied to unsupervised language modelling. OpenAI proposed the Generative Pre-trained Transformer (GPT; Radford et al. (2018)): similar to ULMFiT, GPT was pre-trained on the language modelling objective and then fine-tuned to perform specific tasks, but used Transformers instead of LSTMs. Bidirectional Encoder Representations from Transformers (BERT; Devlin et al. (2019)) incorporated bidirectional context into language modelling: whereas GPT predicted the next word from previous context, BERT proposed the masked language modelling (MLM) objective, where tokens were randomly replaced with [MASK] tokens which the model was trained to predict. BERT’s pre-training procedure consisted of MLM as well as next-sentence prediction (NSP), where the model predicts whether two sentences are consecutive in the original text or two randomly chosen sentences. After pre-training, BERT is then fine-tuned to perform classification, sequence tagging, or sentence pair classification by adding a linear layer on top of the last layer and training for a small number of steps on task-specific data (Figure 2.3). When released, BERT immediately broke new records on many natural language benchmarks, leading to many efforts to improve upon it, study its internals, and apply it to downstream tasks.

BERT can be used to generate contextual embeddings and for language modelling. At each layer, BERT generates hidden representations at each token that can be used as contextual representations: typically, the last or second-to-last layer is used. While BERT is not naturally suitable for language modelling because it assumes bidirectional context is available, a substitute for perplexity was proposed by Salazar et al. (2020), where each token is replaced one at a time with the [MASK] token, BERT computes the log-likelihood score for each masked token, and the sum of log-likelihood scores (called the pseudo-perplexity) may be used for evaluation or to compare sentences for relative acceptability. However, pseudo-perplexity cannot be directly compared to normal perplexity scores from forward language models like LSTMs and GPT. Both forward and bidirectional models have their merits: bidirectional models generate better embeddings as they have access to context in both directions, but forward models are useful for conditionally generating text given an initial prompt.

After the release of BERT, many models have improved its performance by modifying its architecture, data, and training procedure. RoBERTa (Liu et al., 2019b) used the same architecture as BERT, but obtained superior results by removing the NSP task and training on a larger dataset. XLNet (Yang et al., 2019) proposed permutation language modelling, where a random permutation of the sentence is generated and the model learns to predict the next word in the permutation, given previous words and their positional encodings; this avoids the pretrain-finetune discrepancy in BERT where the [MASK] token is seen only in pre-training and not during fine-tuning. ALBERT (Lan et al., 2020) proposed a factorized embedding representation so that models with larger hidden layers can be trained using the same amount of memory, and replaced the NSP task with sentence-order prediction, where the model predicts the order of two sentences that were originally consecutive. ELECTRA (Clark et al., 2020) proposed the replaced token detection pre-training task to improve efficiency over MLM: instead of predicting masked tokens, the model is given corrupted text and predicts which tokens were original and which ones were replacements, using a smaller network to generate replacement tokens.

Transformer language models have been trained for other languages as well, often in a massively multilingual setting so that a single model is able to process text in many languages. Multilingual BERT (mBERT) was released by the authors of BERT, using the same architecture and trained on Wikipedia text. XLM (Conneau and Lample, 2019) added a translation language modelling task to mBERT, where the model predicts a masked token from context and a translation of the sentence into another language: this allows parallel corpora to be leveraged for pre-training. XLM-RoBERTa (XLM-R; (Conneau et al., 2020)) trained XLM on a larger dataset and obtained results competitive with or surpassing the best monolingual models in each language.

2.5 Probing classifiers and their shortcomings

The success of BERT on natural language tasks quickly led many researchers to investigate what information is contained within BERT, and how the information is spread across its 12 layers. This task is nontrivial, as Transformer models internally consist of millions of neurons that are not easily interpretable. A popular approach is by using probing classifiers (or probes): a classifier that takes in BERT embeddings as input and is trained on some target task; the performance on this task is taken as a measure of how much information about the task is contained in the embeddings. The purpose of this probe is not to perform well on the task (since fine-tuning BERT would usually result in better performance), but to measure the extent that the embeddings contain information relevant to the task, without any fine-tuning.

Figure 2.4: Illustration of the edge probing method proposed by Tenney et al. (2019b). Here, the probe is given a sentence “Last week New York City had its worst …” and a span “New York City” and the probing task is to classify the type of named entity represented by the span (Location).

Tenney et al. (2019b) introduced the edge probing method to determine how much more information is contained in contextual embeddings compared to static baselines. The edge probing setup assumes that the input consists of a sentence and up to two spans of consecutive tokens within the sentence, and the output consists of a single label. This formulation is applicable to part-of-speech tagging, dependency arc labelling, and coreference resolution, among others.

The probing model (Figure 2.4) first uses BERT to generate contextual vectors for each token. Then, the mix step learns a task-specific linear combination of the layers; the projection and self-attention pooling produce a fixed-length span representation, and finally, a feedforward layer outputs the classification label. The probe weights are initialized randomly and trained using gradient descent while the BERT weights are kept frozen. Using this method, Tenney et al. (2019b) found the biggest advantage of contextual over static embeddings occurred in syntactic tasks. In a follow-up work, Tenney et al. (2019a) inspected the layerwise weights learned in the mix step and found that semantic tasks learned higher weights on the upper layers of BERT compared to the syntactic tasks, suggesting that the upper layers contained more semantic information. I will discuss the linguistic implications of this experiment and other layerwise probing work in Section 3.3.

One criticism of probing classifiers (especially complex ones such as edge probing) is that high probe performance could either mean that the representation is rich in information, or that the probe itself is powerful and learning the task. Hewitt and Liang (2019) proposed to use control tasks: randomized versions of the probing task constructed so that high performance is only possible if the probe itself learns the task. They defined selectivity as the difference between the real and control task performance; a good probe should have high selectivity; In their experiments, the simple linear probes had the highest selectivity. Alternatively, Voita and Titov (2020) proposed a minimum description length (MDL) probe based on information theory that simultaneously quantifies the probing accuracy and probe model accuracy. Their MDL probe obtained similar results as Hewitt and Liang (2019) but avoided some instability difficulties of the standard probe training procedure. Suffice it to say that there is still no agreement on the best method to probe the quality of LM representations.

Many papers have been published in the last few years that probe various aspects of how BERT and other Transformer models work. This subfield, commonly known as BERTology, explores questions such as what knowledge is contained in which neurons, information present at each layer, the role of architectural choices and pre-training regimes, etc. A detailed survey of BERTology is beyond the scope of this thesis; Rogers et al. (2021) is a recent and comprehensive survey of the field. In the next chapter, I will provide a narrower survey of BERTology work that involves linguistic theory, such as constructing test suites of syntactic phenomena and probing based on psycholinguistics.

3.1 Introduction

In this chapter, I survey the recent literature on probing language models for various aspects of linguistic knowledge. Many papers have been recently published that try to align neural networks with linguistic theory in some way; although each paper’s specific methodology is different, probing methods can generally be categorized into two types: behavioural probes and representational probes. I will discuss behavioural probes in Section 3.2 and representational probes in Section 3.3. To maintain a reasonable scope, this chapter will mostly cover probing work involving transformer models (i.e., BERT and later models).

Behavioural probes apply a black-box approach to probing: they assume little about the internals of the model, and carefully construct input sentences so that the way that the model responds to the input reveals information about its internal capabilities and biases. This approach is relatively robust and likely to remain relevant in spite of future advances in language modelling research because it makes no assumptions about model architecture. In contrast, representational probes assume full access to the model’s internals (for example, its trained parameters, its contextual vectors, and attention weights when fed an input sentence). Because neural representations consist of millions of real numbers, sometimes complex machinery is required to make these results interpretable for humans, and one challenge is choosing which tools are most appropriate. Unlike in human psycholinguistics, we can easily access every state in a neural model and at any point during its processing, making representational probing a powerful and flexible methodology.

Next, in Section 3.4, I will survey LM probing research based on psycholinguistics. The field of psycholinguistics has developed many methods for indirectly exposing language processing facilities in the human brain: common experimental paradigms include lexical decision tests, priming, and EEG studies. This vast literature provides a rich starting point for investigating similar linguistic phenomena in language models. Finally, in Section 3.5, I discuss some recent efforts using language models to provide evidence for linguistic theories: while neural networks have not yet made a significant impact in theoretical linguistics, we are beginning to see initial progress in this direction as well.

3.2 Behavioural probes for syntax

3.2.1 Agreement

Much of the initial work in behavioural probing investigated agreement between the subject and the verb. In English, the subject and verb of a sentence must agree on number and person, for example:

. .The boy likes music. .̱*The car are yellow. .̧Those narwhals come from Tuktoyaktuk.

There are several ways this can be operationalized in a behavioural probe. The first possibility is treating it as a binary classification task of classifying whether a sentence is acceptable or unacceptable. This can be done in a supervised setting (i.e., training on one set of acceptable and unacceptable sentences and evaluating on a different set), or an unsupervised setting (i.e., picking a threshold such that sentences with language model probability above the threshold are considered acceptable). However, the binary classification setup has the drawback that it does not control for the length and contents of the sentence, so that an acceptable sentence 3.2.1 containing rare words may have a lower probability than an unacceptable sentence 3.2.1 containing common words (Lau et al., 2017).

A different approach that resolves this issue by framing the task as a forced choice between two minimal pairs: sentences that only differ on the critical region of interest, and are identical in all other aspects, for example:

. .The car (is/*are) yellow. .̱Those narwhals (*comes/come) from Tuktoyaktuk.

This presents a natural setup for language models supporting masked word prediction such as BERT and RoBERTa: one can feed into the model the sentence The boy [MASK] music

, obtaining a probability distribution for the masked token, and consider the model correct if the model’s probability for

likes is higher than for like. Otherwise, for forward sequential models like LSTMs and GPT, the input stimulus is usually modified so that only the prefix before the masked token is required to predict the correct completion. For example, 3.2.1 would be truncated to The boy …; the model reads this input and generates a prediction for the subsequent token, and is judged as correct if the probability for likes is higher than for like. Construction of input stimuli for forward sequential models is therefore more restrictive than for masked language models, since the critical token must be the final token in the sentence.

Linzen et al. (2016) tested LSTM models on number agreement between subject and verb, using natural sentences from Wikipedia. In addition to simple sentences, they examined more complex sentences containing attractors – nouns between the subject and verb with opposite number from the subject. Such sentences are relatively uncommon but may occur if the subject is modified by a relative clause or prepositional phrase, for example, The car behind those trees (is/*are) yellow. These are difficult cases because the simple heuristic of agreeing with the most recent noun does not work; humans are also known to sometimes produce the incorrect inflection in the presence of attractors (Bock and Miller, 1991). Linzen et al. (2016) found that the supervised models had higher error rates on sentences with attractors (but still better than random), while the unsupervised models were unable to predict agreement better than random when attractors are present.

Gulordava et al. (2018)

investigated the capabilities of LSTMs on long-distance agreement in nonsensical sentences, where all lexical items are replaced with random nonce words that match in part-of-speech and morphological features. The purpose of this experiment was to remove the possibility that the models rely on lexical and semantic cues to predict agreement. The authors generated nonce sentences by perturbing UD treebank sentences in 4 languages: English, Italian, Hebrew, and Russian; they found that in all languages, the LSTM performed worse on the nonce than the original sentences, but better than random. They concluded that the model could not have achieved its results solely through pattern-matching on surface structure, and instead it must have learned some form of deeper linguistic structure.

3.2.2 Other syntactic phenomena

Agreement is a relatively simple linguistic phenomenon, well suited for an initial case study of behavioural probing, and after its initial success, researchers soon expanded behavioural probing to other phenomena. Standard syntax textbooks (e.g., Carnie (2013)) describe a wide assortment of phenomena that may be adapted into benchmark tests for language models.

One of the most direct approaches was the Corpus of Linguistic Acceptability (CoLA;

Warstadt et al. (2019)). CoLA collected 10,657 sentences from various linguistic publications including textbooks and dissertations. Each sentence was labelled in the original publications as either acceptable or unacceptable. Models were allowed a train-test split, where the training set and test set were drawn from different linguistic publications; evaluation was by Matthews correlation coefficient with the ground truth. CoLa has since been incorporated into the GLUE text classification benchmark (Wang et al., 2019b).

Although CoLA contains a wide variety of phenomena, which is desirable for a general-purpose benchmark, its heterogeneity of source material is a hindrance for analysis of specific phenomena. The authors annotated the corpus with the presence or absence of 15 broad classes of phenomena and 63 fine-grained phenomena, but this still resulted in a loosely-related set of sentences within each phenomenon; they instead found template-based generation to be better suited for phenomenon-specific analysis.

Marvin and Linzen (2018) introduced the technique of generating syntactic stimuli using templates: they used a recursive context-free grammar to generate random sentences of different linguistic structures. The advantage of this method is that it allows precise control over the structure of each sentence and how many of each type to generate (since certain structures rarely appear in natural corpora). Furthermore, template generation avoids lexical or any other potential confounds, similar to Gulordava et al. (2018). The authors generated agreement samples containing various types of complements and relative clauses, and additionally examine LSTM performance on reflexive anaphora and negative polarity items (NPIs).

Phenomenon Acceptable Example Unacceptable Example
Anaphor agreement Many girls insulted themselves. Many girls insulted herself.
Argument structure Rose wasn’t disturbing Mark. Rose wasn’t boasting Mark.
Binding Carlos said that Lori helped him. Carlos said that Lori helped himself.
Control/raising There was bound to be a fish escaping. There was unable to be a fish escaping.
Determiner-noun agr. Rachelle had bought that chair. Rachelle had bought that chairs.
Ellipsis Anne’s doctor cleans one important book and Stacey cleans a few. Anne’s doctor cleans one book and Stacey cleans a few important.
Filler-gap Brett knew what many waiters find. Brett knew that many waiters find.
Irregular forms Aaron broke the unicycle. Aaron broken the unicycle.
Island effects Which bikes is John fixing? Which is John fixing bikes?
NPI licensing The truck has clearly tipped over. The truck has ever tipped over.
Quantifiers No boy knew fewer than six guys. No boy knew at most six guys.
Subject-verb agr. These casseroles disgust Kayla. These casseroles disgusts Kayla.
Table 3.1: Example acceptable and unacceptable sentences for the 12 types of linguistic phenomena in BLiMP (Warstadt et al., 2020a).

The Benchmark of Linguistic Minimal Pairs (BLiMP; Warstadt et al. (2020a)) extended the template generation method to 12 different syntactic phenomena and 67 different paradigms (i.e., sub-phenomena). These phenomena include agreement, argument structure, filler-gaps, and NPI licensing (Table 3.1

gives an example for each of the phenomena). They generated 1000 sentences for each paradigm using templates and a lexicon containing about 3,000 items, annotated with various grammatical features to ensure the validity of the generated sentences. They obtained human forced choice judgments from MTurk for quality assurance and to establish a human baseline, and evaluated several forward sequential models (n-gram, LSTM, Transformer-XL, and GPT-2), in an unsupervised setting.

Hu et al. (2020) constructed a similar syntactic test suite of 34 linguistic phenomena, aiming to cover an introductory syntax textbook (Carnie, 2013). However, rather than using binary minimal pairs as in BLiMP, Hu et al. (2020) defined success criteria differently depending on the task, where the model is considered correct if its perplexity on the sentences simultaneously satisfies several pre-defined inequalities. This design allows more control over lexical confounds and task instances involving more than two sentences (for example, a 2x2 design is used for subject-verb agreement). They manipulated model architectures and the amount of training data, and found that the models’ performance on their test suite is not always correlated with perplexity (a common evaluation metric for LM performance).

3.2.3 Gradience of acceptability

Most studies on neural linguistic acceptability so far have adopted a binary scale for acceptability: each sentence is assumed to be either acceptable or not; the model’s response is either correct or incorrect. The final reported metric is either accuracy or Matthews correlation coefficient (MCC), both of which require a discrete threshold and do not allow a gradient response. Niu and Penn (2020) criticized the use of accuracy and MCC for linguistic acceptability benchmarks because they ignore the magnitude of LM probabilities, thus obscuring the differences between the models that are meant to be compared.

Linguistic theories disagree about whether grammars should assign a binary of gradient value to sentence well-formedness. Traditional generative grammar theories (Chomsky, 1957, 1965) assume sentence grammaticality is a binary property; any disagreement of native speakers is due to performance factors such as processing difficulty, while only competence (the stable grammatical knowledge that speakers possess) is relevant to linguistic theory. Lau et al. (2017) provided empirical evidence of acceptability as a gradient phenomenon: they obtained a set of corrupted sentences by feeding English sentences through a round-trip machine translation procedure into another language and back. Then they obtained acceptability judgments on these sentences from MTurk and found that the distribution of ratings more closely resembled ratings of a continuous variable (body weight) than a discrete variable (integer parity). Then, they evaluated several language models on this dataset by computing the correlation between LM probabilities and human acceptability judgments, and found that several of the models correlated nearly as much with humans as human correlations with each other.

Wilcox et al. (2021b) investigated the magnitude of surprisals when an LM encounters an ungrammatical section in a sentence. They constructed a test suite of syntactic minimal pairs similar to BLiMP, and collected human reaction data using the Interpolated Maze paradigm. In this maze task, human participants were presented with a sentence one word at a time, and had to choose between the correct word and a distractor at each step; the response time captures the processing difficulty and is expected to be higher for ungrammatical sections. They compared human response times to surprisals for several LMs, and found that the models consistently under-predicted the magnitude of human processing difficulty. Therefore, they concluded that LMs are less sensitive to syntactic violations than humans.

3.3 Representational probes of LM embeddings

3.3.1 Layerwise probing

Transformer language models have a relatively homogeneous architecture, generating a fixed-dimension vector representation at each layer for each input token. Depending on the use case, these vectors can be probed directly, or they can be collapsed into sentence vectors before probing by taking an average of the vectors for each token. The layerwise architecture makes it straightforward to probe for what linguistic information is contained in the embeddings at each layer.

Tenney et al. (2019a) applied the edge probing method (Section 2.5) on BERT embeddings on a variety of tasks, and found that the probe learned a higher weighting for the upper layers when the task was more semantic (e.g., semantic proto-roles and relation classification), while the middle layers were preferred for syntactic tasks (e.g., part-of-speech tagging and dependency arc labelling). This suggested that the upper layers derived complex semantic representations from simpler syntactic representations in the lower layers, similar to the stages in a traditional NLP pipeline. Jawahar et al. (2019) applied the SentEval toolkit (Conneau and Kiela, 2018) to BERT embeddings, with a similar result: the lower layers were the best at capturing surface features, the middle layers contained the most syntactic information, and the upper layers contained the most semantics.

Kelly et al. (2020) probed BERT as well as some static word vector models for extractability of syntactic construction from a sentence embedding. The task was to determine which of two constructions were used in the input sentence, for example, a ditransitive dative or a prepositional dative. They measured spatial separability in the embedding space as well as probing classifier accuracy, and found that the middle layers of BERT were the most sensitive to the type of syntactic construction present in the sentence.

Yu and Ettinger (2020) assessed several transformer LMs for their ability to represent compositional phrases. In their setup, phrasal representations were probed for paraphrase similarity: whether two short phrases had similar meaning or not. In order to control for lexical overlap, they experimented with AB-BA pairs, where the model must determine the similarity of a two-word phrase and its reversal (e.g., law school has low similarity with school law, whereas adult female has a similar meaning as female adult). Despite trying several different models and methods for generating phrase representations, they obtained poor results on the paraphrase similarity task when lexical overlap was controlled, indicating that the LMs have strong sensitivity to word content but not to nuanced composition.

Miaschi et al. (2020) created a suite of probes for a wide range of linguistic features ranging from surface to lexical and syntactic, and probed layers of BERT for the ability to predict these linguistic features. They found that most features had the best performance in one of the middle layers; performance decreased in the upper layers, and dropped drastically in the last layer. They also experimented with fine-tuning: all linguistic features performed worse after the model was fine-tuned, especially the upper layers, agreeing with earlier results by Liu et al. (2019a) that the upper layers become more specialized towards the specific task when fine-tuned.

Overall, the various layerwise probing experiments reach a similar conclusion about how linguistic information is propagated through the layers of transformer LMs. The initial layer is a stream of non-contextual token embeddings, so only word-level and surface features are available. The lower, middle, and upper layers gradually extract morphosyntactic and semantic features that serve as a good general-purpose representation of language and are useful to many different downstream tasks. Finally, the last layer aggregates information from the previous layers into a representation optimized for the specific task (i.e., masked token prediction in the case of pretrained LMs).

3.3.2 Structural probes for syntax

Hewitt and Manning (2019) proposed a structural probe

for discovering syntax trees contained in contextual embeddings. Their probe measured the extent to which the depth of each token within a dependency parse tree can be uncovered via a linear transformation of their embeddings. This is sufficient to demonstrate the existence of dependency trees in the embeddings because a minimum spanning tree algorithm can be applied to extract the best (undirected) dependency tree.

Specifically, their structural probe finds a linear transformation matrix such that the linear distance of applied to embeddings and approximates their distance in the parse tree:

The linear transformation is learned via gradient descent by minimizing the total difference to the ground truth tree distance across all pairs of tokens from the same sentence in a corpus. The evaluation metrics used were undirected unlabelled attachment score (UUAS) and the Spearman correlation of tree distances compared to the ground truth. They found that dependency trees could be recovered from BERT and ELMo embeddings, but not from the baselines. The middle layers of BERT performed the best for extracting dependency trees, and the performance increased as the rank of approached 64 but increasing the rank beyond that did not further increase the performance. The authors thus concluded that syntactic information was encoded in a fairly low-rank subspace in BERT’s embeddings.

Chi et al. (2020) extended the structural probe to a multilingual setting, using mBERT and data from Universal Dependencies (UD; Zeman et al. (2019)) in 11 languages. They found that the same linear probe was able to reconstruct dependency tree distances in all languages, demonstrating that all languages share a common subspace for syntax in mBERT. Next, they applied a t-SNE visualization on vector representations of dependency arcs, and found that similar dependencies across languages were clustered together. This is surprising given that neither mBERT nor the structural probe had access to labelled dependency information during training.

White et al. (2021)

extended the structural probing method to support nonlinear probes, recasting the probe as kernalized metric learning. This enabled the use of common nonlinear kernels such as the radial basis function (RBF) kernel, which yield nonlinear probes without increasing the probe complexity. They noted that the mathematical structure of the RBF kernel resembled BERT’s self-attention, and hypothesized that this resemblance explained why the RBF kernel outperformed the linear one.

Although the dependency formalism (used by Universal Dependencies) is a popular syntactic framework in computational linguistics, it is only one among many proposed formalisms for syntax. Depending on which formalism is used in probing, we may draw different conclusions about LMs’ syntactic capabilities. Kulmizev et al. (2020) compared the structural probe on UD and an alternative syntactic framework, Surface-Syntactic Universal Dependencies (SUD; (Gerdes et al., 2018)); they found that BERT and ELMo performed better with UD than SUD annotations in most languages. Similarly, in semantic role labelling, Kuznetsov and Gurevych (2020) found that the results of probing differed depending on which formalism was used to annotate the data (PropBank, VerbNet, FrameNet, or Semantic Proto-Roles). Any linguistic probing work must commit to a particular formalism, thus probing research depends substantially on the underlying linguistic theory and the availability of data annotated in these frameworks.

3.3.3 Probes involving LM training

The most common approach to linguistic LM probing has so far been on pretrained models (examining either their outputs or internal representations). From a practical perspective, these experiments have the advantage that they can be performed without large amounts of data or computational resources. Nonetheless, some recent work have explored using pretraining or fine-tuning as a tool to gain insights about the linguistic properties of LMs. By incorporating LM training, these methods reveal which properties are learnable by a family of models or architectures, whereas static probing methods examine which capabilities have been learned by specific models such as BERT and RoBERTa.

Language models are trained on extremely large amounts of data, compared to what an average human is exposed to during language learning. For example, RoBERTa is trained on 30B tokens, while children are exposed to no more than 3-11M words of input a year (Hart and Risley, 2003), or somewhere on the order of 100M words by the time they reach puberty. Zhang et al. (2021) tested the performance of MiniBERTa models (Warstadt et al. (2020b); variants of RoBERTa trained on 1M to 30B words) on a variety of benchmarks, and found that only about 10-100M words are sufficient to learn most syntactic features, but commonsense knowledge and most downstream tasks require much more data. A similar result was reported by Liu et al. (2021), who probed RoBERTa throughout its pretraining process, finding that linguistic knowledge was quickly acquired while commonsense and reasoning was only acquired much later. Thus, it appears that linguistic knowledge is relatively easy for LMs to acquire.

On the other hand, some studies have argued that LMs are still less efficient than humans at acquiring syntax. Huebner et al. (2021) trained BabyBERTa, a smaller version of RoBERTa using child-directed corpora as training data, and evaluated its syntactic capabilities on an adapted version of BLiMP where the vocabulary was replaced with appropriate child-level words. BabyBERTa had poor performance on the syntactic tests relative to RoBERTa, despite receiving a similar amount of input as a 6-year-old child. van Schijndel et al. (2019) showed that most syntactic tasks can be learned from limited data, but some more complex tasks fall short of human performance, and even training on very large datasets does not improve performance to human levels.

Another experimental paradigm involving training is probing LMs for inductive biases. Human languages tend to have syntactic rules that operate on a structural level rather than the surface level (for example, subject-auxiliary inversion moves the structurally highest auxiliary, not the linearly last auxiliary), and Chomsky (1981) hypothesized that humans have an innate structural bias that helps them learn language from limited input.

Warstadt and Bowman (2020) proposed a poverty of stimulus design for probing, where they fine-tuned BERT to classify grammaticality using sentences designed so that it is ambiguous whether a surface or structural rule is required. During test time, they used a different set of sentences to determine whether BERT has learned a surface or structural rule; they found that BERT prefers to learn structural over surface rules when both are equally probable. In a subsequent study, Warstadt et al. (2020b) pretrained MiniBERTa models on between 1M to 30B tokens to investigate the inductive biases of models of varying data size. They found that the smaller models preferred surface generalizations while larger models preferred structural generalizations, which may explain why larger LMs are so successful at downstream tasks but only after crossing a certain data threshold.

3.4 Adapting psycholinguistics to LMs

Psycholinguistics is a field that shares many commonalities with LM probing, but seeks to understand human language processing rather than neural models. Both fields contend with the challenges of indirectly deducing the mechanisms of an entity whose internals are inaccessible or difficult to interpret; thus, psycholinguistics, an older field, provides a rich source of methods and data that can be applied to LM probing. Several differences make LM probing a generally easier endeavour than psycholinguistics: first, each neuron in an LM can be inspected in precise detail whereas human brains can only be imaged with relatively coarse-grained techniques such as EEG and fMRI; second, LMs can be cheaply run on large amounts of data with fully deterministic results. Still, methods from psycholinguistics sometimes require substantial modification to be suitable for LMs.

3.4.1 The N400 response and surprisal

A popular method of measuring human responses to language is using electroencephalography (EEG) to detect electrical activity via electrodes placed at the scalp. Event-related potentials (ERPs) are brain responses to specific stimuli derived from EEG signals, and are good indicators for tracking automatic responses during language processing. A well-known ERP is the N400 response, characterized by a negative potential roughly 400ms after a stimulus, and is generally associated with the stimulus being semantically anomalous with respect to the preceding context. The N400 response is not specific to language (for example, it has been observed using images or environmental sounds as stimuli), nor is it produced by all linguistic anomalies (for example, morphosyntactic violations do not trigger the N400); precisely what conditions trigger the N400 is still an open question (Kutas and Federmeier, 2011). Early psycholinguistic studies proposed that semantic anomalies produce the N400 response while syntactic anomalies produce the P600, a different type of ERP, but this dichotomy was challenged in later studies (Kutas et al., 2006). Presently, the N400 is not known to be aligned with any single component in a linguistic theory.

The N400 is correlated with cloze probability, so it is often used as an approximate estimate of probability in humans. Frank et al. (2015) found that the N400 was correlated with surprisals in a variety of RNN models. Michaelov and Bergen (2020) gathered a large set of psycholinguistic stimuli from different papers about the N400 response, and ran them on RNN models, comparing the LM surprisals with human responses. They confirmed that LM surprisals generally predicted N400 responses, but in some cases such as in morphosyntactic or event structure violations, the LM surprisals were more sensitive than the N400 in humans; thus, the N400 cannot be explained by surprisal alone.

Ettinger (2020) probed BERT using stimuli from three psycholinguistic studies involving commonsense knowledge, semantic role reversals, and negation. These studies were selected because they were cases where cloze probabilities were low, yet the N400 response did not trigger: in other words, they represented hard cases where automatic processing mechanisms could not reveal the extent of the surprisal and deliberate judgment is required. Ettinger found that BERT performed reasonably well on the commonsense reasoning task, was less sensitive than humans to role reversal anomalies, and failed completely at understanding negation (by filling in words of the matching category for prompts like “A robin is not a [MASK]“).

3.4.2 Priming in LMs

Priming is another popular experimental paradigm in psycholinguistics, a phenomenon where exposure to prior stimuli (e.g., lexical items or structures) affects responses to a later stimuli (Pickering and Ferreira, 2008). In structural priming, when people are exposed to a particular syntactic structure, they are more likely to produce the same structure later (Bock and Loebell, 1990), and are able to process sentences of the same structure more quickly. This effect is useful for understanding our cognitive mechanisms for language, since priming is evidence that two sentences share a common internal representation.

Structural and lexical priming have been explored for LM probing, but the priming methodology does not naturally carry over to LMs, which do not have any concept of temporality. Misra et al. (2020) simulated lexical priming by prepending the prime word (either by itself or situated in a carrier sentence) before the target sentence in which BERT predicts a masked word. They found that BERT was facilitated by a relevant prime word only when the target sentence was unconstrained, but when the target sentence was constrained, the prime word instead acted as a distractor and lowered the LM’s probability for the correct word. A limitation of this method is that concatenating a prime and a target sentence creates an unnatural combination that does not occur in natural text, possibly leading to unpredictable out-of-distribution effects.

Prasad et al. (2019) formulated structural priming as model fine-tuning: to measure a priming effect, they proposed to fine-tune a model on a set of sentences with one syntactic structure, then measure the surprisal on another set of target sentences. A priming effect is considered to exist if the surprisal for the target sentences is lower after fine-tuning than the original LM; they found evidence of priming in LSTM models for various types of relative clauses.

3.5 Using LMs as evidence for linguistic theory

During the past few years since the rise of transformer models, there have been an abundance of papers probing the linguistic capabilities of these models. Probing work owes a great deal to the linguistic theories that it is based upon; however, the contribution has so far mostly been unidirectional – neural network probing papers have had negligible impact on theoretical linguistics (Baroni, 2021). Experiments involving LMs have given us an increasingly detailed view of how they process language, but these experiments cannot offer insights about how humans process language. Even when LMs exhibit similar linguistic responses to stimuli as human subjects, it is unclear what exactly are the implications for human linguistics, since the neural architectures share few similarities with the human brain. Yet recently, a small number of papers have proposed ways in which neural networks may contribute to linguistic theory.

van Schijndel and Linzen (2021) considered two theories explaining the processing difficulty of garden path sentences. In the traditional two-stage theory, readers maintain a partial parse while reading a sentence and are forced to reanalyze the parse in a garden path situation, causing a processing delay. In the more recent single-stage theory, readers maintain several parses at the same time, and there is a delay at each word from the processing required to integrate the word into all available parses. The authors used LSTMs to measure the surprisal of each word based on the preceding context, and found that the LSTM surprisals consistently under-predicted the magnitude of processing delays in garden path sentences. By using LSTM surprisals as a measure of predictability, this experiment provided empirical evidence against the single-stage theory, which predicts a linear relationship between reading time and predictability.

Wilcox et al. (2021a) studied the learnability of island constraints, where subtle constraints sometimes prevent movement of a wh-phrase. Linguistic nativism theories have argued that innate knowledge of universal grammar is necessary for children to learn island constraints from limited data, while opponents have denied this claim. The authors tested LMs on sensitivity to a variety of island constraints, finding that LMs are generally successful at this task. This constituted evidence against nativism theories, since LMs are general domain learners that cannot possibly possess innate knowledge of human grammar, yet were able to learn island constraints from data.

The next chapter presents my work on applying LMs toward theories of word class flexibility, contributing a case study that demonstrates the utility of LMs for linguistic theory.

4.1 Introduction

In this chapter, we present a computational methodology to quantify semantic regularities in word class flexibility using contextual word embeddings. Word class flexibility refers to the phenomenon whereby a single word form is used across different grammatical categories, and is considered one of the challenging topics in linguistic typology (Evans and Levinson, 2009). For instance, the word buru in Mundari can be used as a noun to denote ‘mountain’, or as a verb to denote ‘to heap up’ (Evans and Osada, 2005).

There is an extensive literature on how languages vary in word class flexibility, either directly (Hengeveld, 1992; Vogel and Comrie, 2000; Van Lier and Rijkhoff, 2013) or through related notions such as word class conversion (with zero-derivation) (Vonen, 1994; Don, 2003; Bauer and Valera, 2005a; Manova, 2011; Ştekauer et al., 2012). However, existing studies tend to rely on analyses of small sets of lexical items that may not be representative of word class flexibility in the broad lexicon. Critically lacking are systematic analyses of word class flexibility across many languages, and existing typological studies have only focused on qualitative comparisons of word class systems.

We take to our knowledge the first step towards computational quantification of word class flexibility in 37 languages, taken from the Universal Dependencies project (Zeman et al., 2019). We focus on lexical items that can be used both as nouns and as verbs, i.e., noun-verb flexibility. This choice is motivated by the fact that the distinction between nouns and verbs is the most stable in word class systems across languages: if a language makes any distinction between word classes at all, it will likely be a distinction between nouns and verbs (Hengeveld, 1992; Evans, 2000; Croft, 2003). However, our understanding of cross-linguistic regularity in noun-verb flexibility is impoverished.

We operationalize word class flexibility as a property of lemmas. We define a lemma as flexible if some of its occurrences are tagged as nouns and others as verbs. Flexible lemmas are sorted into noun dominant lemmas, which occur more frequently as nouns, and verb dominant lemmas that occur more frequently as verbs. Our methodology builds on contextualized word embedding models (e.g., ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019)) to quantify semantic shift between grammatical classes of a lemma, within a single language. This methodology can also help quantify metrics of flexibility in the lexicon across languages.

We use our methodology to address one of the most fundamental questions in the study of word class flexibility: should this phenomenon be analyzed as a directional word-formation process similar to derivation, or as a form of underspecification? Derived words are commonly argued to have a lower frequency of use and a narrower range in meaning compared to their base (Marchand, 1964; Iacobini, 2000). If word class flexibility is a directional process, we should expect that flexible lemmas are subject to more semantic variation in their dominant word class than in their less frequent class. We also test the claim that noun-to-verb flexibility involves more semantic shift than verb-to-noun flexibility. While previous work has explored these questions, it remains challenging to quantify semantic shift and semantic variation, particularly across different languages.

We present a novel probing task that reveals the ability of deep contextualized models to capture semantic information across word classes. Our utilization of deep contextual models predicts human judgment on the spectrum of noun-verb flexible usages including homonymy (unrelated senses), polysemy (different but related senses), and word class flexibility. We find that BERT outperforms ELMo and non-contextual word embeddings, and that the upper layers of BERT capture the most semantic information, which resonates with existing probing studies (Tenney et al., 2019a). Our source code and data are available at:

4.2 Linguistic background and assumptions

4.2.1 Types of flexibility

The phenomenon of word class flexibility has been analyzed in different ways. One way is to assume the existence of underspecified word classes. For instance, Hengeveld (2013) claims that basic lexical items in Mundari belong to a single class of contentives that can be used to perform all the functions associated with nouns, verbs, adjectives or adverbs in a language like English. Alternatively, word class flexibility can be analyzed as a form of conversion, i.e., as a relation between words that have the same form and closely related senses but different word classes, such as a fish and to fish in English (Adams, 1973). Conversion has been analyzed as a derivational process that relates different lexemes (Jespersen, 1924; Marchand, 1969; Quirk et al., 1985), or as a property of lexemes whose word class is underspecified (Farell, 2001; Barner and Bale, 2002). We use word class flexibility as a general term that subsumes these different notions. This allows us to assess whether there is evidence that word class flexibility should be characterized as a directional word formation process, rather than as a form of underspecification.

4.2.2 Homonymy and polysemy

Word class flexibility has often been analyzed in terms of homonomy and polysemy (Valera and Ruz, 2020). Homonymy is a relation between lexemes that share the same word form but are not semantically related (Cruse, 1986, p.80). Homonyms may differ in word class, such as ring ‘a small circular band’ and ring ‘make a clear resonant or vibrating sound.’ Polysemy is defined as a relation between different senses of a single lexeme (ibid.). Insofar as the nominal and verbal uses of flexible lexical items are semantically related, one may argue that word class flexibility is similar to polysemy, and must be distinguished from homonymy. In practice, homonymy and polysemy exist on a continuum, so it is difficult to apply a consistent criterion to differentiate them (Tuggy, 1993). As a consequence, we will not attempt to tease homonymy apart from word class flexibility.

Regarding morphology, word class flexibility excludes pairs of lexical items that are related by overt derivational affixes, such as to act/an actor. In such cases, word class alternations can be attributed to the presence of a derivational affix, and are therefore part of regular morphology. In contrast, we allow tokens of flexible lexical items to differ in inflectional morphology.

4.2.3 Directionality of class conversion

Word class flexibility can be analyzed either as a static relation between nominal and verbal uses of a single lexeme, or as a word formation process related to derivation. The merits of each analysis have been extensively debated in the literature on conversion (see e.g., Farell, 2001; Don, 2005). One of the objectives of our study is to show that deep contextualized language models can be used to help resolve this debate. A hallmark of derivational processes is their directionality. Direction of derivation can be established using several synchronic criteria, among which are the principles that a derived form tends to have a lower frequency of use and a smaller range of senses than its base (Marchand, 1964; Iacobini, 2000). In languages where word class flexibility is a derivational process, one should therefore expect greater semantic variation when flexible lemmas are used in their dominant word class—an important issue that we verify with our methodology.

A related phenomenon is the relationship between frequency and polysemy. Higher frequency words tend to have more senses as they are influenced to a greater extent by phonetic reduction and sense extension processes (Zipf, 1949; Fenk-Oczlon et al., 2010). In our work, we compare semantic variation between the noun and verb usages of a word rather than semantic variation across different words; the presence of a similar effect would constitute as evidence of word class flexibility as a derivational process.

Semantic variation has been operationalized in several ways. Kisselew et al. (2016) uses an entropy-based metric, while Balteiro (2007) and Bram (2011) measure semantic variation by counting the number of different noun and verb senses in a dictionary. The latter study found that the more frequent word class has greater semantic variation at a rate above random chance. Here we propose a novel metric based on contextual word embeddings to compare the amount of semantic variation of flexible lemmas in their dominant and non-dominant grammatical classes. Differing from existing methods, our metric is validated explicitly on human judgements of semantic similarity, and can be applied to many languages without the need for dictionary resources.

4.2.4 Asymmetry in semantic shift

If word class flexibility is a directional process, a natural question is whether derived verbs stand in the same semantic relation to their base as derived nouns. The literature on conversion suggests that there might be significant differences between these two directions of derivation. In English, verbs that are derived from nouns by conversion have been argued to describe events that include the noun’s denotation as a participant (e.g. hammer, ‘to hit something with a hammer’) or as a spatio-temporal circumstance (winter ‘to spend the winter somewhere’). Clark and Clark (1979) argue that the semantic relations between denominal verbs and their base are so varied that they cannot be given a unified description. In comparison, when the base of conversion is a verb, the derived noun most frequently denotes an event of the sort described by the verb (e.g. throw ‘the act of throwing something’), or the result of such an act (e.g. release ‘state of being set free’) (Jespersen, 1942; Marchand, 1969; Cetnarowska, 1993). This has led some authors to suggest that verb to noun conversion in English involves less semantic shift than noun to verb conversion (Bauer, 2005, p.22). Here we consider a new metric of semantic shift based on contextual embeddings, and we use this metric to test the hypothesis that the expected semantic shift involved in word class flexibility is greater for noun dominant lexical items (as compared to verb dominant lexical items) in our sample of languages. As we will show, this proposal is consistent with the empirical observation that verb-to-noun conversion is statistically more salient than noun-to-verb conversion.

4.3 Identification of word class flexibility

4.3.1 Definitions

A lemma is flexible if it can be used both as a noun and as a verb. To reduce noise, we require each lemma to appear at least 10 times and at least 5% of the time as the minority class to be considered flexible. The inflectional paradigm of a lemma is the set of words that have the lemma.

A flexible lemma is noun (verb) dominant if it occurs more often as a noun (verb) than as a verb (noun). This is merely an empirical property of a lemma: we do not claim that the base POS should be determined by frequency. The noun (verb) flexibility of a language is the proportion of noun (verb) dominant lemmas that are flexible.

4.3.2 Datasets and preprocessing

Our experiments require corpora containing part-of-speech annotations. For English, we use the British National Corpus (BNC), consisting of 100M words of written and spoken English from a variety of sources (Leech, 1992). Root lemmas and POS tags are provided, and were generated automatically using the CLAWS4 tagger (Leech et al., 1994). For our experiments, we use BNC-baby, a subset of BNC containing 4M words.

For other languages, we use the Universal Dependencies (UD) treebanks of over 70 languages, annotated with lemmatizations, POS tags, and dependency information (Zeman et al., 2019). We concatenate the treebanks for each language and use the languages that have at least 100k tokens.

The UD treebanks are too small for our contextualized experiments and are not matched for content and style, so we supplement them with Wikipedia text111We use Wikiextractor to extract text from Wikimedia dumps: For each language, we randomly sample 10M tokens from Wikipedia; we then use UDPipe 1.2 (Straka and Straková, 2017) to tokenize the text and generate POS tags for every token. We do not use the lemmas provided by UDPipe, but instead use the lemma merging algorithm to group lemmas.

4.3.3 Lemma merging algorithm

Language Nouns Verbs
Arabic 1517 299 0.076 0.221
Bulgarian 786 343 0.039 0.047
Catalan 1680 590 0.039 0.147
Chinese 1325 634 0.125 0.391
Croatian 1031 370 0.042 0.062
Danish 324 216 0.108 0.269
Dutch 958 441 0.077 0.188
English 1700 600 0.248 0.472
Estonian 1949 592 0.032 0.115
Finnish 1523 631 0.028 0.136
French 1844 649 0.062 0.257
Galician 802 334 0.031 0.135
German 4239 1706 0.049 0.229
Hebrew 850 315 0.111 0.321
Indonesian 572 243 0.052 0.128
Italian 2227 770 0.067 0.256
Japanese 1105 417 0.178 0.566
Korean 1890 1003 0.026 0.048
Latin 1090 885 0.056 0.122
Norwegian 1951 636 0.072 0.259
Old Russian 527 416 0.034 0.060
Polish 2054 1084 0.069 0.427
Portuguese 1711 638 0.037 0.185
Romanian 1809 740 0.060 0.151
Slovenian 746 316 0.068 0.123
Spanish 2637 873 0.046 0.202
Swedish 784 384 0.038 0.109
Excluded Languages
Ancient Greek 1098 1022 0.015 0.026
Basque 650 247 0.020 0.105
Czech 5468 2063 0.004 0.011
Hindi 1364 133 0.019 0.135
Latvian 1159 603 0.022 0.061
Persian 1125 47 0.010 0.234
Russian 3909 1760 0.005 0.024
Slovak 488 281 0.006 0.011
Ukrainian 659 238 0.006 0.029
Urdu 722 51 0.018 0.216
Table 4.1: Noun and verb flexibility for 37 languages with at least 100k tokens in the UD corpus. We include the 27 languages with over 2.5% noun and verb flexibility; 10 languages are excluded from further analysis.

The UD corpus provides lemma annotations for each word, but these lemmas are insufficient for our purposes because they do not always capture instances of flexibility. In some languages, nouns and verbs are lemmatized to different forms by convention. For example, in French, the word voyage can be used as a verb (il voyage ‘he travels’) or as a noun (un voyage ‘a trip’). However, verbs are lemmatized to the infinitive voyager, whereas nouns are lemmatized to the singular form voyage. Since the noun and verb lemmas are different, it is not easy to identify them as having the same stem.

The different lemmatization conventions of French and English reflect a more substantial linguistic difference. French has a stem-based morphology, in which stems tend to occur with an inflectional ending. By contrast, English has a word-based morphology, where stems are commonly used as free forms (Kastovsky, 2006). This difference is relevant to the definition of word class flexibility: in stem-based systems, flexible items are stems that may not be attested as free forms (Bauer and Valera, 2005b, p.14).

We propose a heuristic algorithm to capture stem-based flexibility as well as word-based flexibility. The key observation is that the inflectional paradigms of the noun and verb forms often have some words in common (such is the case for voyager). Thus, we merge any two lemmas whose inflectional paradigms have a nonempty intersection. This is implemented with a single pass through the corpus, using the union-find data structure: for every word, we call UNION on the inflected form and the lemmatized form.

Using this heuristic, we can identify cases of flexibility that do not share the same lemma in the UD corpus (Table 4.1). This method is not perfect, and is unable to identify cases of stem-based flexibility where the inflectional paradigms don’t intersect, for example in French, chant ‘song’ and chants ‘songs’ are not valid inflections of the verb chanter ‘to sing’. There are also false positives that cause two unrelated lemmas to be merged if their inflectional paradigms intersect, for example, avions (plural form of avion ‘airplane’) happens to have the same form as avions (first person plural imperfect form of avoir ‘to have’).

4.4 Methodology and evaluation

4.4.1 Probing test of contextualized model

We now utilize contextual word embeddings from language models ELMo, BERT, mBERT, and XLM-R (described in Sections 2.3 and 2.4) towards word class flexibility. Contextual embeddings can capture a variety of information other than semantics, which can introduce noise into our results, for example: the lexicographic form of a word, syntactic position, etc. In order to compare different contextual language models on how well they capture semantic information, we perform a probing test of how accurate the models can capture human judgements of word sense similarity.

Word Sim Word Sim Word Sim
aim 137 98 2.0 change 889 858 1.6 force 470 188 0.8
answer 480 335 2.0 claim 222 239 1.6 grant 108 87 0.8
attempt 302 214 2.0 cut 92 488 1.6 note 287 361 0.8
care 403 249 2.0 demand 169 142 1.6 sense 536 88 0.8
control 519 179 2.0 design 246 153 1.6 tear 124 89 0.8
cost 234 192 2.0 experience 522 150 1.6 account 337 122 0.6
count 143 220 2.0 hope 114 571 1.6 act 644 268 0.6
damage 270 82 2.0 increase 252 399 1.6 back 764 88 0.6
dance 81 97 2.0 judge 80 96 1.6 face 1185 281 0.6
doubt 261 132 2.0 limit 125 134 1.6 hold 130 1251 0.6
drink 456 315 2.0 load 230 87 1.6 land 393 123 0.6
end 1171 244 2.0 offer 93 489 1.6 lift 100 165 0.6
escape 95 111 2.0 rise 164 283 1.6 matter 572 294 0.6
estimate 96 118 2.0 smoke 128 100 1.6 order 841 133 0.6
fear 209 99 2.0 start 159 1269 1.6 place 1643 341 0.6
glance 101 161 2.0 step 401 167 1.6 press 130 188 0.6
help 200 897 2.0 study 1037 211 1.6 roll 135 201 0.6
influence 204 150 2.0 support 290 292 1.6 sort 1613 216 0.6
lack 194 107 2.0 trust 90 126 1.6 fire 444 89 0.4
link 147 176 2.0 waste 103 98 1.6 form 1272 354 0.4
love 495 573 2.0 work 1665 1593 1.6 notice 115 387 0.4
move 131 1272 2.0 base 109 378 1.4 play 185 1093 0.4
name 960 112 2.0 cover 137 399 1.4 turn 226 1566 0.4
need 587 2350 2.0 plant 591 82 1.4 wave 402 120 0.4
phone 382 238 2.0 run 152 999 1.4 cross 102 215 0.2
plan 321 161 2.0 stress 159 106 1.4 deal 191 315 0.2
question 1285 96 2.0 approach 409 175 1.2 hand 1765 127 0.2
rain 182 92 2.0 cause 237 530 1.2 present 219 353 0.2
result 752 206 2.0 match 110 123 1.2 set 387 652 0.2
return 138 441 2.0 miss 320 410 1.2 share 104 232 0.2
search 215 163 2.0 process 720 91 1.2 sign 284 121 0.2
sleep 171 291 2.0 shift 96 104 1.2 suit 162 108 0.2
smell 141 149 2.0 show 132 1843 1.2 wind 189 82 0.2
smile 211 422 2.0 sound 313 496 1.2 address 257 148 0.0
talk 119 1302 2.0 dress 191 196 1.0 bear 110 394 0.0
use 791 2801 2.0 lead 107 716 1.0 head 1355 96 0.0
view 811 102 2.0 light 669 124 1.0 mind 736 620 0.0
visit 136 203 2.0 look 699 5893 1.0 park 179 105 0.0
vote 124 93 2.0 mark 562 198 1.0 point 1534 469 0.0
walk 144 914 2.0 measure 226 223 1.0 ring 185 387 0.0
dream 254 107 1.8 rest 414 132 1.0 square 225 82 0.0
record 1057 276 1.8 tie 82 112 1.0 state 471 156 0.0
report 313 331 1.8 break 117 519 0.8 stick 109 294 0.0
test 273 126 1.8 charge 392 115 0.8 store 95 158 0.0
touch 145 271 1.8 drive 88 476 0.8 train 224 94 0.0
call 209 1558 1.6 focus 92 168 0.8 watch 119 940 0.0
Table 4.2: 138 flexible words in English (top in BNC corpus) and human similarity scores, average of 5 ratings.

We begin with a list of the 138 most frequent flexible words in the BNC corpus. Some of these words are flexible (e.g., work), while others are homonyms (e.g., bear). For each lemma, we get five human annotators from Mechanical Turk to make a sentence using the word as a noun, then make a sentence using the word as a verb, then rate the similarity of the noun and verb senses on a scale from 0 to 2. The sentences are used for quality assurance, so that ratings are removed if the sentences are nonsensical. We will call the average human rating for each word the human similarity score; Table 4.2 shows the average ratings for all 138 words.

Next, we evaluate each layer of ELMo, BERT, mBERT, and XLM-R222We use the models ‘bert-base-uncased’, ‘bert-base-multilingual-cased’, and ‘xlm-roberta-base’ from Wolf et al. (2019). on correlation with the human similarity score. That is, we compute the mean of the contextual vectors for all noun instances of the given word in the BNC corpus, the mean across all verb instances, then compute the cosine distance between the two mean vectors as the model’s similarity score. Finally, we evaluate the Spearman correlation of the human and model’s similarity scores for 138 words: this score measures the model’s ability to gauge the level of semantic similarity between noun and verb senses, compared to human judgements.

For a baseline, we do the same procedure using non-contextual GloVe embeddings (Pennington et al., 2014). Note that while all instances of the same word have a static embedding, different words that share the same lemma still have different embeddings (e.g., work and works), so that the baseline is not trivial.

Figure 4.1: Spearman correlations between human and model similarity scores for ELMo, BERT, mBERT, and XLM-R. The dashed line is the baseline using static GloVe embeddings.

The correlations are shown in Figure 4.1. BERT and mBERT are better than ELMo and XLM-R at capturing semantic information, in all transformer models, the correlation increases for each layer up until layer 4 or so, and after this point, the performance neither improves nor degrades in higher layers. Thus, unless otherwise noted, we use the final layers of each model for downstream tasks.

Figure 4.2: PCA plot of BERT embeddings for the lemmas “work” (high similarity between noun and verb senses) and “ring” (low similarity).

Figure 4.2 illustrates the contextual distributions for two lemmas on the opposite ends of the noun-verb similarity spectrum: work (human similarity score: 2) and ring (human similarity score: 0). We apply PCA to the BERT embeddings of all instances of each lemma in the BNC corpus. For work, the noun and verb senses are very similar and the distributions have high overlap. In contrast, for ring, the most common noun sense (‘a circular object’) is etymologically and semantically unrelated to the most common verb sense (‘to produce a resonant sound’), and accordingly, their distributions have very little overlap.

Language NV shift VN shift
Arabic 0.098 0.109 8.268 8.672 8.762 8.178
Bulgarian 0.146 0.136 8.267 8.409 8.334 8.341
Catalan 0.165 0.169 8.165 8.799 8.720 8.244
Chinese 0.072 0.070 7.024 7.212 7.170 7.067
Croatian 0.093 0.144 8.149 8.109 8.219 8.037
Danish 0.103 0.110 8.245 8.338 8.438 8.146
Dutch 0.146 0.174 7.716 8.786 8.354 8.148
English 0.175 0.160 8.035 8.624 8.390 8.268
Estonian 0.105 0.103 7.800 7.902 8.022 7.679
Finnish 0.100 0.114 7.972 7.854 8.181 7.644
French 0.212 0.204 8.189 9.472 9.082 8.578
Galician 0.111 0.117 7.922 8.340 8.137 8.127
German 0.382 0.355 8.078 9.758 9.096 8.740
Hebrew 0.121 0.130 8.096 9.116 8.574 8.638
Indonesian 0.034 0.048 7.100 7.076 7.076 7.101
Italian 0.207 0.184 8.520 9.345 9.149 8.716
Japanese 0.061 0.057 7.419 7.173 7.309 7.283
Latin 0.092 0.139 7.920 7.710 7.905 7.724
Norwegian 0.133 0.132 8.112 8.336 8.332 8.116
Polish 0.090 0.080 8.318 8.751 8.670 8.399
Portuguese 0.186 0.155 7.907 8.921 8.642 8.187
Romanian 0.175 0.145 8.682 8.658 8.934 8.406
Slovenian 0.093 0.113 8.046 7.983 8.177 7.853
Spanish 0.235 0.214 7.898 8.961 8.691 8.168
Swedish 0.088 0.082 8.262 8.147 8.328 8.081
Overall 1 of 3 2 of 3 3 of 17 14 of 17 20 of 20 0 of 20
Table 4.3: Semantic metrics for 25 languages, computed using mBERT and 10M tokens of Wikipedia text for each language. Asterisks denote significance at , , . For the “Overall” row, we count the languages with a significant tendency towards one direction, out of the number of languages with statistical significance towards either direction (with treated as significant).

4.4.2 Three contextual metrics

We define three metrics based on contextual embeddings to measure various semantic aspects of word class flexibility. We start by generating contextual embeddings for each occurrence of every flexible lemma. For each lemma , let and be the set of contextual embeddings for noun and verb instances of . We define the prototype noun vector of a lemma as the mean of embeddings across noun instances, and the noun variation as the mean Euclidean distance from each noun instance to the noun vector:


The prototype verb vector and verb variation for a lemma are defined similarly:


Lemmas are included if they appear at least 30 times as nouns and 30 times as verbs. To avoid biasing the variation metric towards the majority class, we downsample the majority class to be of equal size as the minority class before computing the variation. The method does not filter out pairs of lemmas that are arguably homonyms rather than flexible (section 4.2.2); we choose to include all of these instances rather than set an arbitrary cutoff threshold.

We now define language-level metrics to measure the asymmetries hypothesized in sections 4.2.3 and 4.2.4. The noun-to-verb shift (NVS) is the average cosine distance between the prototype noun and verb vectors for noun dominant lemmas, and the verb-to-noun shift (VNS) likewise for verb dominant lemmas:


We define the noun (verb) variation of a language as the average of noun (verb) variations across all lemmas. Finally, define the majority variation of a language as the average of the variation of the dominant POS class, and the minority variation as the average variation of the smaller POS class, across all lemmas.

Dataset Model NV shift VN shift
BNC ELMo 0.389 0.357 20.261 20.455 20.329 20.388
BERT 0.122 0.112 9.015 9.074 9.100 8.989
mBERT 0.189 0.169 7.211 8.401 7.875 7.717
XLM-R 0.004 0.005 2.058 2.374 2.262 2.170
Wikipedia ELMo 0.339 0.330 22.556 22.521 22.463 22.614
BERT 0.120 0.100 9.218 8.944 9.118 9.044
mBERT 0.175 0.160 8.035 8.624 8.390 8.268
XLM-R 0.004 0.003 1.966 1.954 1.946 1.974
Table 4.4: Comparison of semantic models on BNC and Wikipedia datasets (English), computed using several different language models. Asterisks denote significance at , , .

4.5 Results

4.5.1 Identifying flexible lemmas

Of the 37 languages in UD with at least 100k tokens; in 27 of them, at least 2.5% of verb and noun lemmas are flexible, which we take to indicate that word class flexibility exists in the language (Table 4.1). The lemma merging algorithm is crucial for identifying word class flexibility: only 6 of the 37 languages pass the 2.5% flexibility threshold using the default lemma annotations provided in UD333Chinese, Danish, English, Hebrew, Indonesian, and Japanese pass the flexibility threshold without the lemma merging algorithm.. Languages differ in their prevalence of word class flexibility, but every language in our sample has higher verb flexibility than noun flexibility.

4.5.2 Asymmetry in semantic metrics

Table 4.3 shows the values of the three metrics, computed using mBERT and Wikipedia data for 25 languages444We exclude 2 of the 27 languages that we identify word class flexibility. Old Russian was excluded because it is not supported by mBERT; Korean is excluded because the lemma annotations deviate from the standard UD format.

. For testing significance, we use the unpaired Student’s t-test to compare N-V versus V-N shift, and the paired Student’s t-test for the other two metrics

555We do not apply the Bonferroni correction for multiple comparisons, because we make claims for trends across all languages, and not for any specific languages.. The key findings are as follows:

  1. Asymmetry in semantic shift. In English, N-V shift is greater than V-N shift, in agreement with Bauer (2005). However, this pattern does not hold in general: there is no significant difference in either direction in most languages, and two languages exhibit a difference in the opposite direction as English.

  2. Asymmetry in semantic variation between noun and verb usages. Of the 17 languages with a statistically significant difference in noun versus verb variation, 14 of them have greater verb variation than noun variation.

  3. Asymmetry in semantic variation between majority and minority classes. All of the 20 languages with a statistically significant difference in majority and minority variation have greater majority variation.

4.5.3 Model robustness

Next, we assess the robustness of our metrics with respect to choices of corpus and language model. Robustness is desirable because it gives confidence that our models capture true linguistic tendencies, rather than artifacts of our datasets or the models themselves. We compute the three semantic metrics on the BNC and Wikipedia datasets, using all 4 contextual language models: ELMo, BERT, mBERT, and XLM-R. Table 4.4 summarizes the results from this experiment.

We find that in almost every case where there is a statistically significant difference, all models agree on the direction of the difference. One exception is that noun variation is greater when computed using Wikipedia data than when using the BNC corpus. Wikipedia has many instances of nouns used in technical senses (e.g., ring is a technical term in mathematics and chemistry), whereas similar nonfiction text is less common in the BNC corpus.

4.6 Discussion

4.6.1 Frequency asymmetry

Every language in our sample has verb flexibility greater than noun flexibility. The reasons for this asymmetry are unclear, but may be due to semantic differences between nouns and verbs. We note that every language in our sample has more noun lemmas than verb lemmas, a pattern that was also attested by Polinsky (2012), although this does not provide an explanation of the observed phenomenon. We leave further exploration of the flexibility asymmetry to future work.

4.6.2 Implications for theories of flexibility

There is a strong cross-linguistic tendency for the majority word class of a flexible lemma to exhibit more semantic variation than the minority class. In other words, the frequency and semantic variation criteria of determining the base of a conversion pair agree more than at chance. This supports the analysis of word class flexibility as a directional process of conversion, as opposed to underspecification (section 4.2.3)666Since 18 of the 25 languages for which semantic metrics were calculated are Indo-European, it is unclear whether these results generalize to non-Indo-European languages.. Within a flexible lemma, verbs exhibit more semantic variation than nouns. It is attested across many languages that nouns are more physically salient, while verbs have more complex event and argument structure, and are harder for children to acquire than nouns (Gentner, 1982; Imai et al., 2008). Thus, verbs are expected to have greater semantic variation than nouns, which our results confirm. More importantly, for our purposes, this metric serves as a control for the previous metric. Flexible lemmas are more likely to be noun-dominant than verb-dominant, so could the majority and minority variation simply be proxies for noun and verb variation, respectively? In fact, we observe greater verb than noun variation, so this cannot be the case.

Finally, as suggested by Bauer (2005), we find evidence in English that N-V flexibility involves more semantic shift than V-N flexibility, and the pattern is consistent across multiple models and datasets (Table 4.4). However, this pattern is idiosyncratic to English and not a cross-linguistic tendency. It is thus instructive to analyze multiple languages in studying word class flexibility, as one can easily be misled by English-based analyses.

4.7 Conclusion

We used contextual language models to examine shared tendencies in word class flexibility across languages. We found that the majority class often exhibits more semantic variation than the minority class, supporting the view that word class flexibility is a directional process. We also found that in English, noun-to-verb flexibility is associated with more semantic shift than verb-to-noun flexibility, but this is not the case for most languages.

Our probing task revealed that the upper layers of BERT contextual embeddings best reflect human judgment of semantic similarity. We obtained similar results in different datasets and language models in English that support the robustness of our method. In this chapter, we demonstrated the utility of deep contextualized models in linguistic typology, especially for characterizing cross-linguistic semantic phenomena that are otherwise difficult to quantify. The next two chapters will present our work using linguistic theory and experimental data to deepen our understanding of language models.

5.1 Introduction

The previous chapter used contextual embeddings to measure the semantic distance between flexible words used in different contexts. One limitation of this method is that different linguistic properties – morphology, syntax, semantics, and pragmatics – are conflated into a single metric. From contextual embeddings, we cannot easily determine whether the difference between two words is syntactic (e.g., walk and walks) or semantic (e.g., walk and run).

Figure 5.1: Example sentence with a morphosyntactic anomaly (left) and semantic anomaly (right) (anomalies in bold). Darker colours indicate higher surprisal. We investigate several patterns: first, surprisal at lower layers corresponds to infrequent tokens, but this effect diminishes towards upper layers. Second, morphosyntactic violations begin to trigger high surprisals at an earlier layer than semantic violations.

In this chapter, we investigate how Transformer-based language models respond to sentences containing three different types of anomalies: morphosyntactic, semantic, and commonsense. Previous work using behavioural probing found that Transformer LMs have remarkable ability in detecting when a word is anomalous in context, by assigning a higher likelihood to the appropriate word than an inappropriate one (Gulordava et al., 2018; Ettinger, 2020; Warstadt et al., 2020a). The likelihood score, however, only gives a scalar value of the degree that a word is anomalous in context, and cannot distinguish between different ways that a word might be anomalous.

It has been proposed that there are different types of linguistic anomalies. Chomsky (1957) distinguished semantic anomalies (“colorless green ideas sleep furiously”) from ungrammaticality (“furiously sleep ideas green colorless”). Psycholinguistic studies initially suggested that different event-related potentials (ERPs) are produced in the brain depending on the type of anomaly; e.g., semantic anomalies produce negative ERPs 400 ms after the stimulus, while syntactic anomalies produce positive ERPs 600 ms after (Kutas et al., 2006). Here, we ask whether Transformer LMs show different surprisals in their intermediate layers depending on the type of anomaly. However, LMs do not compute likelihoods at intermediate layers – only at the final layer.

We introduce a new tool to probe for surprisal at intermediate layers of BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019b), and XLNet (Yang et al., 2019), formulating the problem as density estimation. We train Gaussian models to fit distributions of embeddings at each layer of the LMs. Using BLiMP (Warstadt et al., 2020a) for evaluation, we show that this model is effective at grammaticality judgement, requiring only a small amount of in-domain text for training. Figure 5.1 shows the method using the RoBERTa model on two example sentences.

We apply our model to test sentences drawn from BLiMP and 7 psycholinguistics studies, exhibiting morphosyntactic, semantic, and commonsense anomalies. We find that morphosyntactic anomalies produce out-of-domain embeddings at earlier layers, semantic anomalies at later layers, and commonsense anomalies not at any layer, even though the LM’s final accuracy is similar. We show that LMs are internally sensitive to the type of linguistic anomaly, which is not apparent if we only had access to their softmax probability outputs. Our source code and data are available at:

5.2 Related work

Our work builds on earlier work on probing LM representations (Section 3.3). Previous work found differences in the linguistic knowledge contained in different layers (Tenney et al., 2019a; Kelly et al., 2020; Hewitt and Manning, 2019); we focus on the effects of anomalous inputs on different layers. Behavioural probes (Section 3.2) often used anomalous sentences paired with correct sentences to test LMs’ sensitivity to linguistic phenomena (Linzen et al., 2016; Gulordava et al., 2018; Warstadt et al., 2020a; Hu et al., 2020); in this work, we extend these tests to probe the sensitivity of internal layer representations to anomalies rather than the model’s output.

Most grammaticality studies focused on syntactic phenomena, since they are the easiest to generate using templates, although some studies considered semantic phenomena. Examples of semantic tests include Rabinovich et al. (2019), who tested LMs’ sensitivity to semantic infelicities involving indefinite pronouns, and Ettinger (2020), who used data from three psycholinguistic studies to probe BERT’s knowledge of commonsense and negation. Another type of linguistic unacceptability is selectional restrictions, defined as a semantic mismatch between a verb and an argument. Sasano and Korhonen (2020) examined the geometry of word classes (e.g., words that can be a direct object of the verb ‘play’) in word vector models; they compared single-class models against discriminative models for learning word class boundaries. Chersoni et al. (2018) tested distributional semantic models on their ability to identify selectional restriction violations using stimuli from two psycholinguistic datasets. Finally, Metheniti et al. (2020) tested how much BERT relies on selectional restriction information versus other contextual information for making masked word predictions. Our work combines these different types of unacceptability into a single test suite to faciliate comparison.

5.3 Model

We use the transformer language model as a contextual embedding extractor (we write this as BERT for convenience). Let be the layer index, which ranges from 0 to 12 on all of our models. Using a training corpus , we extract contextual embeddings at layer for each token:


Next, we fit a multivariate Gaussian on the extracted embeddings:


For evaluating the layerwise surprisal of a new sentence , we similarly extract contextual embeddings using the language model:


The surprisal of each token is the negative log likelihood of the contextual vector according to the multivariate Gaussian:


Finally, we define the surprisal of sentence as the sum of surprisals of all of its tokens, which is also the joint log likelihood of all of the embeddings:


5.3.1 Connection to Mahalanobis distance

The theoretical motivation for using the sum of log likelihoods is that when we fit a Gaussian model with full covariance matrix, low likelihood corresponds exactly to high Mahalanobis distance from the in-distribution points. The score given by the Gaussian model is:


where is the dimension of the vector space, and is the Mahalanobis distance:


Rearranging, we get:


thus the negative log likelihood is the squared Mahalanobis distance plus a constant.

Various methods based on Mahalanobis distance have been used for anomaly detection in neural networks; for example, Lee et al. (2018) proposed a similar method for out-of-domain detection in neural classification models, and Cao et al. (2020) found the Mahalanobis distance method to be competitive with more sophisticated methods on medical out-of-domain detection. In Transformer models, Podolskiy et al. (2021) used Mahalanobis distance for out-of-domain detection, outperforming methods based on softmax probability and likelihood ratios.

Gaussian assumptions.

Our model assumes that the embeddings at every layer follow a multivariate Gaussian distribution. Since the Gaussian distribution is the maximum entropy distribution given a mean and covariance matrix, it makes the fewest assumptions and is therefore a reasonable default.

Hennigen et al. (2020) found that embeddings sometimes do not follow a Gaussian distribution, but it is unclear what alternative distribution would be a better fit, so we will assume a Gaussian distribution in this work.

5.3.2 Training and evaluation

Figure 5.2: BLiMP accuracy different amounts of training data and across layers, for three LMs. About 1000 sentences are needed before a plateau is reached (mean tokens per sentence = 15.1).

For all of our experiments, we use the ‘base’ versions of pretrained language models BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019b), and XLNet (Yang et al., 2019), provided by HuggingFace (Wolf et al., 2019). Each of these models have 12 contextual layers plus a 0 static layer, and each layer is 768-dimensional.

We train the Gaussian model on randomly selected sentences from the British National Corpus (Leech, 1992), representative of acceptable English text from various genres. We evaluate on BLiMP (Warstadt et al., 2020a), a dataset of 67k minimal sentence pairs that test acceptability judgements across a variety of syntactic and semantic phenomena. In our case, a sentence pair is considered correct if the sentence-level surprisal of the unacceptable sentence is higher than that of the acceptable sentence.

How much training data is needed? We experiment with training data sizes ranging from 10 to 10,000 sentences (Figure 5.1(a)). Compared to the massive amount of data needed for pretraining the LMs, we find that a modest corpus suffices for training the Gaussian anomaly model, and a plateau is reached after 1000 sentences for all three models. Therefore, we use 1000 training sentences (unless otherwise noted) for all subsequent experiments in this chapter.

Which layers are sensitive to anomaly? We vary from 0 to 12 in all three models (Figure 5.1(b)). The layer with the highest accuracy differs between models: layer 9 has the highest accuracy for BERT, 11 for RoBERTa, and 6 for XLNet. All models experience a sharp drop in the last layer, likely because the last layer is specialized for the MLM pretraining objective.

Comparisons to other models. Our best-performing model is RoBERTa, with an accuracy of 0.830. This is slightly higher the best model reported in BLiMP (GPT-2, with accuracy 0.801). We do not claim to beat the state-of-the-art on BLiMP: Salazar et al. (2020) obtains a higher accuracy of 0.865 using RoBERTa-large. Even though the main goal of this work is not to maximize accuracy on BLiMP, our Gaussian anomaly model is competitive with other transformer-based models on this task.

5.3.3 Further ablation studies on Gaussian model

Covariance Accuracy
Full 0.830
Diagonal 0.755
Spherical 0.752
Table 5.1: Varying the type of covariance matrix in the Gaussian model.
Components Accuracy
1 0.830
2 0.841
4 0.836
8 0.849
16 0.827
Table 5.2:

Using Gaussian mixture models (GMMs) with multiple components.

Genre Accuracy
Academic 0.797
Fiction 0.840
News 0.828
Spoken 0.795
All 0.830
Table 5.3: Effect of the genre of training data.
Kernel Score
RBF 0.738
Linear 0.726
Polynomial 0.725
Table 5.4: Using 1-SVM instead of GMM, with various kernels.
Aggregation Accuracy
Sum 0.830
Max 0.773
Table 5.5: Two sentence-level aggregation strategies

We explore some variations to our methodology of training the Gaussian model. All of these variations are evaluated on the full BLiMP dataset. In each experiment, (unless otherwise noted) the language model is RoBERTa-base, using the second-to-last layer, and the Gaussian model has a full covariance matrix trained with 1000 sentences from the BNC corpus.

Covariance matrix. We vary the type of covariance matrix (Table 5.5). Diagonal and spherical covariance matrices perform worse than with the full covariance matrix; this may be expected, as the full matrix has the most trainable parameters.

Gaussian mixture models. We try GMMs with up to 16 mixture components (Table 5.5). We observe a small increase in accuracy compared to a single Gaussian, but the difference is too small to justify the increased training time.

Genre of training text. We sample from genres of BNC (each time with 1000 sentences) to train the Gaussian model (Table 5.5). The model performed worse when trained with the academic and spoken genres, and about the same with the fiction and news genres, perhaps because their vocabularies and grammars are more similar to those in the BLiMP sentences.

One-class SVM. We try replacing the Gaussian model with a one-class SVM (Schölkopf et al., 2000), another popular model for anomaly detection. We use the default settings from scikit-learn with three kernels (Table 5.5), but it performs worse than the Gaussian model on all settings.

Sentence aggregation. Instead of Equation 5.5, we try defining sentence-level surprisal as the maximum surprisal among all tokens (Table 5.5):


however, this performs worse than using the sum of token surprisals.

5.3.4 Lower layers are sensitive to frequency

We notice that surprisal scores in the lower layers are sensitive to token frequency: higher frequency tokens produce embeddings close to the center of the Gaussian distribution, while lower frequency tokens are at the periphery. The effect gradually diminishes towards the upper layers.

Figure 5.3: Pearson correlation between token-level surprisal scores (Equation 5.4) and log frequency. The correlation is highest in the lower layers, and decreases in the upper layers.

To quantify the sensitivity to frequency, we compute token-level surprisal scores for 5000 sentences from BNC that were not used in training. We then compute the Pearson correlation between the surprisal score and log frequency for each token (Figure 5.3). In all three models, there is a high correlation between the surprisal score and log frequency at the lower layers, which diminishes at the upper layers. A small positive correlation persists until the last layer, except for XLNet, in which the correlation eventually disappears.

There does not appear to be any reports of this phenomenon in previous work. For static word vectors, Gong et al. (2018) found that embeddings for low-frequency words lie in a different region of the embedding space than high-frequency words. To visualize this phenomenon, we feed a random selection of BNC sentences into RoBERTa and use PCA to visualize the distribution of rare and frequent tokens at different layers (Figure 5.4). In all cases, we find that infrequent tokens occupy a different region of the embedding space from frequent tokens, similar to what Gong et al. (2018) observed for static word vectors. The Gaussian model fits the high-frequency region and assigns lower likelihoods to the low-frequency region, explaining the positive correlation at all layers, although it is still unclear why the correlation diminishes at upper layers.

Figure 5.4: PCA plot of randomly sampled RoBERTa embeddings at layers 1, 4, 7, and 10. Points are colored by token frequency: “Rare” means the 20% least frequent tokens, and “Frequent” is the other 80%.

5.4 Levels of linguistic anomalies

We turn to the question of whether LMs exhibit different behaviour when given inputs with different types of linguistic anomalies. The task of partitioning linguistic anomalies into several distinct classes can be challenging. Syntax and semantics have a high degree of overlap – there is no widely accepted criterion for distinguishing between ungrammaticality and semantic anomaly (e.g., Abrusán (2019) gives a survey of current proposals), and Poulsen (2012) challenges this dichotomy entirely. Similarly, Warren et al. (2015) noted that semantic anomalies depend somewhat on world knowledge.

Within a class, the anomalies are also heterogeneous (e.g., ungrammaticality may be due to violations of agreement, wh-movement, negative polarity item licensing, etc), which might each affect the LMs differently. Thus, we define three classes of anomalies that do not attempt to cover all possible linguistic phenomena, but captures different levels of language processing while retaining internal uniformity:

  1. Morphosyntactic anomaly: an error in the inflected form of a word, for example, subject-verb agreement (*the boy eat the sandwich), or incorrect verb tense or aspect inflection (*the boy eaten the sandwich). In each case, the sentence can be corrected by changing the inflectional form of one word.

  2. Semantic anomaly: a violation of a selectional restriction, such as animacy (#the house eats the sandwich). In these cases, the sentence can be corrected by replacing one of the verb’s arguments with another one in the same word class that satisfies the verb’s selectional restrictions.

  3. Commonsense anomaly: sentence describes an situation that is atypical or implausible in the real world but is otherwise acceptable (#the customer served the waitress).

Type Task Correct Example Incorrect Example
Morphosyntax BLiMP (Subject-Verb) These casseroles disgust Kayla. These casseroles disgusts Kayla.
BLiMP (Det-Noun) Craig explored that grocery store. Craig explored that grocery stores.
Osterhout and Nicol (1999)
The cats won’t eat the food that
Mary gives them.
The cats won’t eating the food that
Mary gives them.
Semantic BLiMP (Animacy)
Amanda was respected by some
Amanda was respected by some
Pylkkänen and McElree (2007)
The pilot flew the airplane after
the intense class.
The pilot amazed the airplane after
the intense class.
Warren et al. (2015)
Corey’s hamster explored a nearby
backpack and filled it with sawdust.
Corey’s hamster entertained a nearby
backpack and filled it with sawdust.
Osterhout and Nicol (1999)
The cats won’t eat the food that
Mary gives them.
The cats won’t bake the food that
Mary gives them.
Osterhout and Mobley (1995)
The plane sailed through the air and
landed on the runway.
The plane sailed through the air and
laughed on the runway.
Commonsense Warren et al. (2015)
Corey’s hamster explored a nearby
backpack and filled it with sawdust.
Corey’s hamster lifted a nearby
backpack and filled it with sawdust.
Federmeier and Kutas (1999)
“Checkmate,” Rosalie announced
with glee. She was getting to be
really good at chess.
“Checkmate,” Rosalie announced
with glee. She was getting to be
really good at monopoly.
Chow et al. (2016)
The restaurant owner forgot which
customer the waitress had served.
The restaurant owner forgot which
waitress the customer had served.
Urbach and Kutas (2010)
Prosecutors accuse defendants of
committing a crime.
Prosecutors accuse sheriffs of
committing a crime.
Table 5.6: Example sentence pair for each of the 12 tasks. The 3 BLiMP tasks are generated from templates; the others are stimuli materials taken from psycholinguistic studies.

5.4.1 Summary of anomaly datasets

We use two sources of data for experiments on linguistic anomalies: synthetic sentences generated from templates, and materials from psycholinguistic studies. Both have advantages and disadvantages – synthetic data can be easily generated in large quantities, but the resulting sentences may be odd in unintended ways. Psycholinguistic stimuli are designed to control for confounding factors (e.g., word frequency) and human-validated for acceptability, but are smaller (typically fewer than 100 sentence pairs).

We curate a set of 12 tasks from BLiMP and 7 psycholinguistic studies111Several of these stimuli have been used in natural language processing research. Chersoni et al. (2018) used the data from Pylkkänen and McElree (2007) and Warren et al. (2015) to probe word vectors for knowledge of selectional restrictions. Ettinger (2020) used data from Federmeier and Kutas (1999) and Chow et al. (2016), which were referred to as CPRAG-102 and ROLE-88 respectively.. Each sentence pair consists of a control and an anomalous sentence, so that all sentences within a task differ in a consistent manner. Table 5.6 shows an example sentence pair from each task. We summarize each dataset:

  1. BLiMP (Warstadt et al., 2020a): we use subject-verb and determiner-noun agreement tests as morphosyntactic anomaly tasks. For simplicity, we only use the basic regular sentences, and exclude sentences involving irregular words or distractor items. We also use the two argument structure tests involving animacy as a semantic anomaly task. All three BLiMP tasks therefore have 2000 sentence pairs.

  2. Osterhout and Nicol (1999): contains 90 sentence triplets containing a control, syntactic, and semantic anomaly. Syntactic anomalies involve a modal verb followed by a verb in -ing form; semantic anomalies have a selectional restriction violation between the subject and verb. There are also double anomalies (simultaneously syntactic and semantic) which we do not use.

  3. Pylkkänen and McElree (2007): contains 70 sentence pairs where the verb is replaced in the anomalous sentence with one that requires an animate object, thus violating the selectional restriction. In half the sentences, the verb is contained in an embedded clause.

  4. Warren et al. (2015): contains 30 sentence triplets with a possible condition, a selectional restriction violation between the subject and verb, and an impossible condition where the subject cannot carry out the action, i.e., a commonsense anomaly.

  5. Osterhout and Mobley (1995): we use data from experiment 2, containing 90 sentence pairs where the verb in the anomalous sentence is semantically inappropriate. The experiment also tested gender agreement errors, but we do not include these stimuli.

  6. Federmeier and Kutas (1999): contains 34 sentence pairs, where the final noun in each anomalous sentence is an inappropriate completion, but in the same semantic category as the expected completion.

  7. Chow et al. (2016): contains 44 sentence pairs, where two of the nouns in the anomalous sentence are swapped to reverse their roles. This is the only task in which the sentence pair differs by more than one token.

  8. Urbach and Kutas (2010): contains 120 sentence pairs, where the anomalous sentence replaces a patient of the verb with an atypical one.

5.4.2 Quantifying layerwise surprisal

Let be a dataset of sentence pairs, where is a control sentence and is an anomalous sentence. For each layer , we define the surprisal gap

as the mean difference of surprisal scores between the control and anomalous sentences, scaled by the standard deviation:


The surprisal gap is a scale-invariant measure of sensitivity to anomaly, similar to a signal-to-noise ratio. While surprisal scores are unitless, the surprisal gap may be viewed as the number of standard deviations that anomalous sentences trigger surprisal above control sentences. This is advantageous over accuracy scores, which treats the sentence pair as correct when the anomalous sentence has higher surprisal by any margin; this hard cutoff masks differences in the magnitude of surprisal. The metric also allows for fair comparison of surprisal scores across datasets of vastly different sizes. We plot the surprisal gap for all 12 tasks, using RoBERTa (Figure

5.5), BERT (Figure 5.6), and XLNet (Figure 5.7).

Next, we compare the performance of the Gaussian model with the masked language model (MLM). We score each instance as correct if the masked probability of the correct word is higher than the anomalous word. One limitation of the MLM approach is that it requires the sentence pair to be identical in all places except for one token, since the LMs do not support modeling joint probabilities over multiple tokens. To ensure fair comparison between GM and MLM, we exclude instances where the differing token is out-of-vocabulary in any of the LMs (this excludes approximately 30% of instances). For the Gaussian model, we compute accuracy using the best-performing layer for each model (Section 5.2). The results are listed in Table 5.7.

Type Task Size BERT RoBERTa XLNet
Morphosyntax BLiMP (Subject-Verb) 2000 0.953 0.955 0.971 0.957 0.827 0.584
BLiMP (Det-Noun) 2000 0.970 0.999 0.983 0.999 0.894 0.591
Osterhout and Nicol (1999) 90 1.000 1.000 1.000 1.000 0.901 0.718
Semantic BLiMP (Animacy) 2000 0.644 0.787 0.767 0.754 0.675 0.657
Pylkkänen and McElree (2007) 70 0.727 0.955 0.932 0.955 0.636 0.727
Warren et al. (2015) 30 0.556 1.000 0.944 1.000 0.667 0.556
Osterhout and Nicol (1999) 90 0.681 0.957 0.841 1.000 0.507 0.783
Osterhout and Mobley (1995) 90 0.528 1.000 0.906 0.981 0.302 0.774
Commonsense Warren et al. (2015) 30 0.600 0.550 0.750 0.450 0.300 0.600
Federmeier and Kutas (1999) 34 0.458 0.708 0.583 0.875 0.625 0.667
Chow et al. (2016) 44 0.591 n/a 0.432 n/a 0.568 n/a
Urbach and Kutas (2010) 120 0.470 0.924 0.485 0.939 0.500 0.712
Table 5.7: Comparing accuracy scores between Gaussian anomaly model (GM) and masked language model (MLM) for all models and tasks. Asterisks indicate that the accuracy is not better than random (0.5), using a binomial test with threshold of for significance. The MLM results for Chow et al. (2016) are excluded because the control and anomalous sentences differ by more than one token. The best layers for each model (Section 5.2) are used for GM, and the last layer is used for MLM. Generally, MLM outperforms GM, and the difference is greater for semantic and commonsense tasks.
Figure 5.5: Layerwise surprisal gaps for all tasks using the RoBERTa model. Generally, a positive surprisal gap appears in earlier layers for morphosyntactic tasks than for semantic tasks; no surprisal gap appears at any layer for commonsense tasks.
Figure 5.6: Layerwise surprisal gaps for all tasks using the BERT model.
Figure 5.7: Layerwise surprisal gaps for all tasks using the XLNet model.

5.5 Discussion

5.5.1 Anomaly type and surprisal

We first discuss the results from RoBERTa (Figure 5.5), where morphosyntactic anomalies generally appear earlier than semantic anomalies. The surprisal gap plot exhibits different patterns depending on the type of linguistic anomaly: morphosyntactic anomalies produce high surprisal relatively early (layers 3-4), while semantic anomalies produce low surprisals until later (layers 9 and above). Commonsense anomalies do not result in surprisals at any layer: the surprisal gap is near zero for all of the commonsense tasks. The observed difference between morphosyntactic and semantic anomalies is consistent with previous work (Tenney et al., 2019a), which found that syntactic information appeared earlier in BERT than semantic information.

One should be careful and avoid drawing conclusions from only a few experiments. A similar situation occurred in psycholinguistics research (Kutas et al., 2006): early results suggested that the N400 was triggered by semantic anomalies, while syntactic anomalies triggered the P600 – a different type of ERP. However, subsequent experiments found exceptions to this rule, and now it is believed that the N400 cannot be categorized by any standard dichotomy, like syntax versus semantics (Kutas and Federmeier, 2011). In our case, Pylkkänen and McElree (2007) is an exception: the task is a semantic anomaly, but produces surprisals in early layers, similar to the morphosyntactic tasks. Hence it is possible that the dichotomy is something other than syntax versus semantics; we leave to future work to determine more precisely what conditions trigger high surprisals in lower versus upper layers of LMs.

5.5.2 Comparing anomaly model with MLM

The masked language model (MLM) usually outperforms the Gaussian anomaly model (GM), but the difference is uneven. MLM performs much better than GM on commonsense tasks, slightly better on semantic tasks, and about the same or slightly worse on morphosyntactic tasks. It is not obvious why MLM should perform better than GM, but we note two subtle differences between the MLM and GM setups that may be contributing factors. First, the GM method adds up the surprisal scores for the whole sequence, while MLM only considers the softmax distribution at one token. Second, the input sequence for MLM always contains a [MASK] token, whereas GM takes the original unmasked sequences as input, so the representations are never identical between the two setups.

MLM generally outperforms GM, but it does not solve every task: all three LMs fail to perform above chance on the data from Warren et al. (2015). This set of stimuli was designed so that both the control and impossible completions are not very likely or expected, which may have caused the difficulty for the LMs. We excluded the task of Chow et al. (2016) for MLM because the control and anomalous sentences differed by more than one token222Sentence pairs with multiple differing tokens are inconvenient for MLM to handle, but this is not a fundamental limitation. For example, Salazar et al. (2020) proposed a modification to MLM to handle such cases: they compute a pseudo-log-likelihood score for a sequence by replacing one token at a time with a [MASK] token, applying MLM to each masked sequence, and summing up the log likelihood scores..

5.5.3 Differences between LMs

RoBERTa is the best-performing of the three LMs in both the GM and MLM settings: this is expected since it is trained with the most data and performs well on many natural language benchmarks. Surprisingly, XLNet is ill-suited for this task and performs worse than BERT, despite having a similar model capacity and training data.

The surprisal gap plots for BERT (Figure 5.6) and XLNet (Figure 5.7) show some differences from RoBERTa: only morphosyntactic tasks produce out-of-domain embeddings in these two models, and not semantic or commonsense tasks. Evidently, how LMs behave when presented with anomalous inputs is dependent on model architecture and training data size; we leave exploration of this phenomenon to future work.

5.6 Conclusion

We used Gaussian models to characterize out-of-domain embeddings at intermediate layers of Transformer language models. The model requires a relatively small amount of in-domain data. Our experiments revealed that out-of-domain points in lower layers correspond to low-frequency tokens, while grammatically anomalous inputs are out-of-domain in higher layers. Furthermore, morphosyntactic anomalies are recognized as out-of-domain starting from lower layers compared to syntactic anomalies. Commonsense anomalies do not generate out-of-domain embeddings at any layer, even when the LM has a preference for the correct cloze completion. These results show that depending on the type of linguistic anomaly, LMs use different mechanisms to produce the output softmax distribution.

6.1 Introduction

This chapter continues our theme of probing language models using methods derived from linguistic theory. While the previous chapter focused on linguistic anomalies, here we shift our attention to examining argument structure in construction grammar theories and their representations in language models. Most probing work so far has investigated the linguistic knowledge of LMs on phenomena such as agreement, binding, licensing, and movement (Warstadt et al., 2020a; Hu et al., 2020) with a particular focus on determining whether a sentence is linguistically acceptable (Schütze, 1996). Relatively little work has attempted to determine whether the linguistic knowledge induced by LMs is more similar to a formal grammar of the sort postulated by mainstream generative linguistics (Chomsky, 1965, 1981, 1995), or to a network of form-meaning pairs as advocated by construction grammar (Goldberg, 1995, 2006).

Figure 6.1: Four argument structure constructions (ASCs) used by Bencini and Goldberg (2000), with example sentences (top right). Constructions are mappings between form (bottom left) and meaning (bottom right).

One area where construction grammar disagrees with many generative theories of language is in the analysis of the argument structure of verbs, that is, the specification of the number of arguments that a verb takes, their semantic relation to the verb, and their syntactic form (Levin and Rappaport Hovav, 2005). Lexicalist theories were long dominant in generative grammar (Chomsky, 1981; Kaplan and Bresnan, 1982; Pollard and Sag, 1987). In lexicalist theories, argument structure is assumed to be encoded in the lexical entry of the verb: for example, the verb visit is lexically specified as being transitive and as requiring a noun phrase object (Chomsky, 1986). In contrast, construction grammar suggests that argument structure is encoded in form-meaning pairs known as argument structure constructions (ASCs, Figure 6.1), which are distinct from verbs. The argument structure of a verb is determined by pairing it with an ASC (Goldberg, 1995). To date, a substantial body of psycholinguistic work has provided evidence for the psychological reality of ASCs in sentence sorting (Bencini and Goldberg, 2000; Gries and Wulff, 2005), priming (Ziegler et al., 2019), and novel verb experiments (Kaschak and Glenberg, 2000; Johnson and Goldberg, 2013).

Here we connect basic research in ASCs with neural probing by adapting several psycholinguistic studies to Transformer-based LMs and show evidence for the neural reality of ASCs. Our first case study is based on sentence sorting (Bencini and Goldberg, 2000); we discover that in English, German, Italian, and Spanish, LMs consider sentences that share the same construction to be more semantically similar than sentences sharing the main verb. Furthermore, this preference for constructional meaning only manifests in larger LMs (trained with more data), whereas smaller LMs rely on the main verb, an easily accessible surface feature. Human experiments with non-native speakers found a similarly increased preference for constructional meaning in more proficient speakers (Liang, 2002; Baicchi and Della Putta, 2019), suggesting commonalities in language acquisition between LMs and humans.

Our second case study is based on nonsense “Jabberwocky” sentences that nevertheless convey meaning when they are arranged in constructional templates (Johnson and Goldberg, 2013). We adapt the original priming experiment to LMs and show that RoBERTa is able to derive meaning from ASCs, even without any lexical cues. This finding offers counter-evidence to earlier claims that LMs are relatively insensitive to word order when constructing sentence meaning (Yu and Ettinger, 2020; Sinha et al., 2021). Our source code and data are available at:

6.2 Linguistic background

6.2.1 Construction grammar and ASCs

Construction grammar is a family of linguistic theories proposing that all linguistic knowledge consists of constructions: pairings between form and meaning where some aspects of form or meaning are not predictable from their parts (Fillmore et al., 1988; Kay and Fillmore, 1999; Goldberg, 1995, 2006). Common examples include idiomatic expressions such as under the weather (meaning “to feel unwell”), but many linguistic patterns are constructions, including morphemes (e.g., -ify), words (e.g., apple), and abstract patterns like the ditransitive and passive. In contrast to lexicalist theories of argument structure, construction grammar rejects the dichotomy between syntax and lexicon. In contrast to transformational grammar, it rejects any distinction between surface and underlying structure.

We focus on a specific family of constructions for which there is an ample body of psycholinguistic evidence: argument structure constructions (ASCs). ASCs are constructions that specify the argument structure of a verb (Goldberg, 1995). In the lexicalist, verb-centered view, argument structure is a lexical property of the verb, and the main verb of a sentence determines the form and meaning of the sentence (Chomsky, 1981; Kaplan and Bresnan, 1982; Pollard and Sag, 1987; Levin and Rappaport Hovav, 1995). For example, sneeze is intransitive (allowing no direct object) and hit is transitive (requiring one direct object). However, lexicalist theories encounter difficulties with sentences like “he sneezed the napkin off the table” since intransitive verbs are not permitted to have object arguments.

Rather than assuming multiple implausible senses for the verb “sneeze” with different argument structures, Goldberg (1995) proposed that ASCs operate on an arbitrary verb, altering its argument structure while at the same time modifying its meaning. For example, the caused-motion ASC adds a direct object and a path argument to the verb sneeze, with the semantics of causing the object to move along the path. Other ASCs include the transitive, ditransitive, and resultative (Figure 6.1), which specify the argument structure of a verb and interact with its meaning in different ways.

6.2.2 Psycholinguistic evidence for ASCs

Transitive Ditransitive Caused-motion Resultative
Throw Anita threw the hammer. Chris threw Linda the pencil. Pat threw the keys onto the roof. Lyn threw the box apart.
Get Michelle got the book. Beth got Liz an invitation. Laura got the ball into the net. Dana got the mattress inflated.
Slice Barbara sliced the bread. Jennifer sliced Terry an apple. Meg sliced the ham onto the plate. Nancy sliced the tire open.
Take Audrey took the watch. Paula took Sue a message. Kim took the rose into the house. Rachel took the wall down.
Table 6.1: Stimuli from Bencini and Goldberg (2000), consisting of a 4x4 design, with 4 different verbs and 4 different argument structure constructions.

Sentence sorting. Several psycholinguistic studies have found evidence for argument structure constructions using experimental methods. Among these, Bencini and Goldberg (2000) used a sentence sorting task to determine whether the verb or construction in a sentence was the main determinant of sentence meaning. 17 participants were given 16 index cards with sentences containing 4 verbs (throw, get, slice, and take) and 4 constructions (transitive, ditransitive, caused-motion, and resultative) and were instructed to sort them into 4 piles by overall sentence meaning (Table 6.1). The experimenters measured the deviation to a purely verb-based or construction-based sort, and found that on average, the piles were closer to a construction sort.

Non-native sentence sorting. The same set of experimental stimuli was used with L2 (non-native) English speakers. Gries and Wulff (2005) ran the experiment with 22 German native speakers, who preferred the construction-based sort over the verb-based sort, showing that constructional knowledge is not limited to native speakers. Liang (2002) ran the experiment on Chinese native speakers of 3 different English levels (46 beginner, 31 intermediate, and 33 advanced), and found that beginners preferred a verb-based sort, while advanced learners produced construction-based sorts similar to native speakers (Figure 3). Likewise, Baicchi and Della Putta (2019) found the same result in Italian native speakers with B1 and B2 English proficiency levels. Overall, these studies show evidence for ASCs in the mental representations of native and L2 English speakers alike, and furthermore, preference for constructional over verb sorting increases with increasing English proficiency.

Multilingual sentence sorting. Similar sentence sorting experiments have been conducted in other languages, with varying results. Kirsch (2019) ran a sentence sorting experiment in German with 40 participants and found that they mainly sorted by verb but rarely by construction. Baicchi and Della Putta (2019) ran an experiment with non-native learners of Italian (15 participants of B1 level and 10 participants of B2 level): both groups preferred the constructional sort, and similar to Liang (2002), the B2 learners sorted more by construction than the B1 learners. Vázquez (2004) ran an experiment in Spanish with 16 participants, and found approximately equal proportions of constructions and verb sort. In Italian and Spanish, some different constructions were substituted as not all of the English constructions had an equivalent in these languages.

Priming. Another line of psycholinguistic evidence comes from priming studies. Priming refers to the condition where exposure to a (prior) stimulus influences the response to a later stimulus (Pickering and Ferreira, 2008). Bock and Loebell (1990) found that participants were more likely to produce sentences of a given syntactic structure when primed with a sentence of the same structure; Ziegler et al. (2019) argued that Bock and Loebell (1990) did not adequately control for lexical overlap, and instead, they showed that the construction must be shared for the priming effect to occur, not just shared abstract syntax.

Novel verbs. Even with unfamiliar words, there is evidence that constructions are associated with meaning. Kaschak and Glenberg (2000) constructed sentences with novel denominal verbs and found that participants were more likely to interpret a transfer event when the denominal verb was used in a ditransitive sentence (Tom crutched Lyn an apple) than a transitive one (Tom crutched an apple).

Johnson and Goldberg (2013) used a “Jabberwocky” priming task to show that abstract constructional templates are associated with meaning. Participants were primed with a nonsense sentence of a given construction (e.g., He daxed her the norp for the ditransitive construction), followed by a lexical decision task of quickly deciding if a string of characters was a real English word or a non-word. The word in the decision task was semantically congruent with the construction (gave) or incongruent (made); furthermore, they experimented with target words that were high-frequency (gave), low-frequency (handed), or semantically related but not associated with the construction (transferred). They found priming effects (faster lexical decision times) in all three conditions, with the strongest effect for the high-frequency condition, followed by the low-frequency and the semantically nonassociate conditions.

6.2.3 Related work in NLP

Chapter 3 of this thesis surveyed recent work in language model probing based on linguistic theories, although relatively few of them have approached probing from a construction grammar perspective. Madabushi et al. (2020) probed for BERT’s knowledge of constructions via a sentence pair classification task of predicting whether two sentences share the same construction. Their probe was based on data from Dunn (2017), who used an unsupervised algorithm to extract plausible constructions from corpora based on association strength. However, the linguistic validity of these automatically induced constructions is uncertain, and there is currently no human-labelled wide-coverage construction grammar dataset in any language suitable for probing. Other computational work focused on a few specific constructions, such as identifying caused-motion constructions in corpora (Hwang and Palmer, 2015) and annotating constructions related to causal language (Dunietz et al., 2015). Lebani and Lenci (2016) is the most similar to our work: they probed distributional vector space models for ASCs based on the Jabberwocky priming experiment by Johnson and Goldberg (2013).

In this work, we adapt several of the previously mentioned psycholinguistic studies to LMs: the sentence sorting experiments in Case study 1, and the Jabberwocky priming experiment in Case study 2. We choose these studies because their designs allow for thousands of stimuli sentences to be generated automatically using templates, avoiding issues caused by small sample sizes from manually constructed sentences.

6.3 Case study 1: Sentence sorting

This section describes our adaptation of the sentence sorting experiments to Transformer LMs.

6.3.1 Methodology

Models. To simulate varying non-native English proficiency levels, we use MiniBERTa models (Warstadt et al., 2020b), trained with 1M, 10M, 100M, and 1B tokens. We also use the base RoBERTa model (Liu et al., 2019b), trained with 30B tokens. In other languages, there are no available pretrained checkpoints with varying amounts of pretraining data, so we use the mBERT model (Devlin et al., 2019) and a monolingual Transformer LM in each language.111We use monolingual German and Italian models from, and the monolingual Spanish model from Cañete et al. (2020). We obtain sentence embeddings for our models by taking the average of their contextual token embeddings at the second-to-last layer (i.e., layer 11 for base RoBERTa). We use the second-to-last because the last layer is more specialized for the LM pretraining objective and less suitable for sentence embeddings (Liu et al., 2019a).

Transitive Ditransitive Caused-motion Resultative
Slice Harry sliced the bread. Henry sliced Eric the box. Sam sliced the ball onto the bed. John sliced the book apart.
Kick Thomas kicked the box. Mike kicked Frank the ball. Michael kicked the wall into the house. James kicked the door open.
Cut George cut the ball. Adam cut Paul the tree. Bill cut the box into the water. Bob cut the bread apart.
Get Tom got the book. Andrew got Steve the door. Jack got the fridge onto the elevator. David got the ball stuck.
Table 6.2: Example of our 4x4 sentence sorting stimuli, similar to those by Bencini and Goldberg (2000) in Table 6.1, but generated automatically using templates.
Transitive Ditransitive Caused-motion Resultative
Werfen Anita warf den Hammer. Berta warf Linda den Bleistift. Erika warf den Schlüsselbund auf das Dach. Laura warf die Kisten auseinander.
Bringen Michelle brachte das Buch. Simone brachte Lydia eine Einladung. Emma brachte den Ball ins Netz. Leonie brachte die Stühle zusammen.
Schneiden Karolin schnitt das Brot. Luisa schnitt Paula einen Apfel. Jennifer schnitt die Wurst auf den Teller. Doris schnitt den Reifen auf.
Nehmen Maria nahm die Uhr. Sophia nahm Jasmin das Geld. Helena nahm die Rosen in das Haus. Theresa nahm das Plakat herunter.
Table 6.3: German sentence sorting stimuli, obtained from Kirsch (2019).
Transitive Prepositional Dative Caused-motion Resultative
Dare Lauda dà un esame. Carlo dà una mela a Maria. Luca dà una spinta a Franco. Paolo dà una verniciata di verde alla porta.
Fare Mario fa una torta. Luigi fa un piacere a Giovanna. Fabio fa entrare la macchina in garage. Stefano fa bruciare il sugo.
Mettere Annalisa mette la giacca. Riccardo mette il cappello al bambino. Silvia mette la penna nel cassetto. Filippo mette la casa in ordine.
Portare Linda porta lo zaino. Laura porta la pizza a Francesco. Michele porta il libro in biblioteca. Irene porta l’esercizio a termine.
Table 6.4: Italian sentence sorting stimuli, obtained from Baicchi and Della Putta (2019).
Transitive Ditransitive Unplanned Reflexive Middle
Romper Carlos rompió el cristal. Alfonso le rompió las gafas a Pepe. A Juan se le rompieron los pantalones. La porcelana se rompe con facilidad.
Doblar Felipe dobló el periódico. Pablo le dobló el brazo a Lucas. A Pedro se le dobló el tobillo. El aluminio se dobla bien.
Acabar Leonardo acabó su tesis. Tomás le acabó la pasta de dientes a Santi. A Luis se le acabaron los cigarrillos. Las carreras de 10 km se acaban sin problemas.
Cortar Isidro cortó el pan. Jorge le cortó el paso a Yago. A Ignacio se le cortó la conexión. Esta tela se corta muy bien.
Table 6.5: Spanish sentence sorting stimuli, obtained from Vázquez (2004).

Template generation. We use templates to generate stimuli similar to the 4x4 design in the Bencini and Goldberg (2000) experiment. To ensure an adequate sample size, we run multiple empirical trials. In each trial, we sample 4 random distinct verbs from a pool of 10 verbs that are compatible with all 4 constructions (cut, hit, get, kick, pull, punch, push, slice, tear, throw). We then randomly fill in the slots for proper names, objects, and complements for each sentence according to its verb, such that the sentence is semantically coherent, and there is no lexical overlap among the sentences of any construction. Table 6.5 shows a set of template-generated sentences. In English, we generate 1000 sets of stimuli using this procedure.

For other languages, we use the original stimuli from their respective publications. We present the sentence sorting stimuli for German (Table 6.5), Italian (Table 6.5), and Spanish (Table 6.5). German uses the same four constructions as English. Italian does not have the ditransitive construction but instead uses the prepositional dative construction to express transfer semantics. Spanish has no equivalents for the caused-motion and resultative constructions, so the authors in that experiment instead used the unplanned reflexive (expressing accidental or unplanned events), and the middle construction (expressing states pertaining to the subject).

Figure 6.2: English sentence sorting results for humans and LMs, measured by deviation from pure construction and verb sort (CDev and VDev). Non-native human results are from Liang (2002); native human results from Bencini and Goldberg (2000).333Bencini and Goldberg (2000) ran the sentence sorting experiment twice, so we take the average of the two runs.LM results are obtained using MiniBERTas (Warstadt et al., 2020b) and RoBERTa (Liu et al., 2019b)

on templated stimuli. The MiniBERTa models use between 1M to 1B tokens for pretraining, while RoBERTa uses 30B tokens. Error bars indicate 95% confidence intervals.

Figure 6.3: PCA plots of Bencini and Goldberg (2000) sentence sorting using the 1M and 100M MiniBERTa models and RoBERTa-base (30B). Figure best viewed in color.
Figure 6.4: Multilingual sentence sorting results for German (Kirsch, 2019), Italian (Baicchi and Della Putta, 2019), and Spanish (Vázquez, 2004). LM results are obtained using the same stimuli; we use both mBERT and a monolingual LM for each language.

Evaluation. Similar to the human experiments, we group the sentence embeddings into 4 clusters (not necessarily of the same size) using agglomerative clustering by Euclidean distance (Pedregosa et al., 2011). We then compute the deviation to a pure construction and pure verb sort using the Hungarian algorithm for optimal bipartite matching. This measures the minimal number of cluster assignment changes necessary to reach a pure construction or verb sort, ranging from 0 to 12. Thus, lower construction deviation indicates that constructional information is more salient in the LM’s embeddings.

6.3.2 Results and interpretation

Figure 3 shows the LM sentence sorting results for English. All differences are statistically significant (

). The smallest 1M MiniBERTa model is the only LM to prefer verb over construction sorting, and as the amount of pretraining data grows, the LMs increasingly prefer sorting by construction instead of by verb. This closely mirrors the trend observed in the human experiments. To visualize this effect, we apply principal components analysis (PCA) on sentence embeddings for the 1M and 100M token MiniBERTa models and RoBERTa-base (Figure

6.3). In RoBERTa, there is strong evidence of clustering based on constructions; the effect is unclear in the 100M model and nonexistent in the 1M model, visually confirming our quantitative evaluation based on the construction and verb deviation metrics.

The results for multilingual sorting are shown in Figure 6.4. Both mBERT and the monolingual LMs consistently prefer constructional sorting over verb sorting in all three languages, whereas the results from the human experiments are less consistent.

Our results show that RoBERTa can generalize meaning from abstract constructions without lexical overlap. Only larger LMs and English speakers of more advanced proficiency are able to make this generalization, while smaller LMs and less proficient speakers derive meaning more from surface features like lexical content. This finding agrees with Warstadt et al. (2020b), who found that larger LMs have an inductive bias towards linguistic generalizations, while smaller LMs have an inductive bias towards surface generalizations; this may explain the success of large LMs on downstream tasks. A small quantity of data (10M tokens) is sufficient for LMs to prefer the constructional sort, indicating that ASCs are relatively easy to learn: roughly on par with other types of linguistic knowledge, and requiring less data than commonsense knowledge (Zhang et al., 2021; Liu et al., 2021).

We note some limitations in these results, and reasons to avoid drawing unreasonably strong conclusions from them. Human sentence sorting experiments can be influenced by minor differences in the experimental setup: Bencini and Goldberg (2000) obtained significantly different results in two runs that only differed on the precise wording of instructions. In the German experiment (Kirsch, 2019), the author hypothesized that the participants were influenced by a different experiment that they had completed before the sentence sorting one. Given this experimental variation, we cannot attribute differences across languages to differences in their linguistic typology. Although LMs do not suffer from the same experimental variation, we cannot conclude statistical significance from the multilingual experiments, where only one set of stimuli is available in each language.

6.4 Case study 2: Jabberwocky constructions

We next adapt the “Jabberwocky” priming experiment from Johnson and Goldberg (2013) to LMs, and make several changes to the original setup to better assess the capabilities of LMs. Priming is a standard experimental paradigm in psycholinguistic research, but it is not directly applicable to LMs: existing methods simulate priming either by applying additional fine-tuning (Prasad et al., 2019), or by concatenating sentences that typically do not co-occur in natural text (Misra et al., 2020). Therefore, we instead propose a method to probe LMs for the same linguistic information using only distance measurements on their contextual embeddings.

6.4.1 Methodology

Template generation. We generate sentences for the four constructions randomly using the templates in Table 6.6. Instead of filling nonce words like norp into the templates as in the original study, we take an approach similar to Gulordava et al. (2018) and generate 5000 sentences for each construction by randomly filling real words of the appropriate part-of-speech into construction templates (Table 6.6). This gives nonsense sentences like “She traded her the epicenter”; we refer to these random words as Jabberwocky words. By using real words, we avoid any potential instability from feeding tokens into the model that it has never seen during pretraining. We obtain a set of singular nouns, past tense verbs, and adjectives from the Penn Treebank (Marcus et al., 1993), excluding words with fewer than 10 occurrences.

Construction Template / Examples
Ditransitive S/he V-ed him/her the N.
She traded her the epicenter.
He flew her the donut.
Resultative S/he V-ed it Adj.
He cut it seasonal.
She surged it civil.
Caused-motion S/he V-ed it on the N.
He registered it on the diamond.
She awarded it on the corn.
Removal S/he V-ed it from him/her.
He declined it from her.
She drove it from him.
Table 6.6: Templates and example sentences for the Jabberwocky construction experiments. The templates are identical to the ones used in Johnson and Goldberg (2013), except that we use random real words instead of nonce words.
Figure 6.5: In our adapted Jabberwocky experiment, we measure the Euclidean distance from the Jabberwocky verb (traded) to the 4 prototype verbs, of which 1 is congruent () with the construction of the sentence, and 3 are incongruent ().

Verb embeddings. Our probing strategy is based on the assumption that the contextual embedding for a verb captures its meaning in context. Therefore, if LMs associate ASCs with meaning, we should expect the contextual embedding for the Jabberwocky verb to contain the meaning of the construction. Specifically, we measure the Euclidean distance to a prototype verb for each construction (Figure 6.5). These are verbs that Johnson and Goldberg (2013) selected whose meaning closely resembles the construction’s meaning: gave, made, put, and took for the ditransitive, resultative, caused-motion, and removal constructions, respectively.444The reader may notice that the four constructions here are slightly different from Bencini and Goldberg (2000): the transitive construction is replaced with the removal construction in Johnson and Goldberg (2013). We also run the same setup using lower frequency prototype verbs from the same study: handed, turned, placed, and removed.555Johnson and Goldberg (2013) also included a third experimental condition using four verbs that are semantically related but not associated with the construction, but one of the verbs is very low-frequency (ousted), so we exclude this condition in our experiment. As a control, we measure the Euclidean distance to the prototype verbs of the other three unrelated constructions.

The prototype verb embeddings are generated by taking the average across their contextual embeddings across a 4M-word subset of the British National Corpus (BNC; Leech (1992)). We use the second-to-last layer of RoBERTa-base, and in cases where a verb is split into multiple subwords, we take the embedding of the first subword token as the verb embedding.

6.4.2 Results and interpretation

Figure 6.6: Euclidean distance between Jabberwocky and prototype verbs for congruent and incongruent conditions. Error bars indicate 95% confidence intervals.
Figure 6.7: Mean Euclidean distance between Jabberwocky and prototype verbs in each verb-construction pair. Diagonal entries (gray border) are the congruent conditions; off-diagonal entries are incongruent.

We find that the Euclidean distance between the prototype and Jabberwocky verb embeddings is significantly lower () when the verb is congruent with the construction than when they are incongruent, and this is observed for both high and low-frequency prototype verbs (Figure 6.6). Examining the individual constructions and verbs (Figure 6.7), we note that in the high-frequency scenario, the lowest distance prototype verb is always the congruent one, for all four constructions. In the low-frequency scenario, the result is less consistent: the congruent verb is not always the lowest distance one, although it is always still at most the second-lowest distance out of the four.

The main result holds for both high and low-frequency scenarios, but the correct prototype verb is associated more consistently in the high-frequency case. This agrees with Wei et al. (2021), who found that LMs have greater difficulty learning the linguistic properties of less frequent words. We also note that the Euclidean distances are higher overall in the low-frequency scenario, which is consistent with previous work that found lower frequency words to occupy a peripheral region of the embedding space (Li et al., 2021).

6.4.3 Potential confounds

In any experiment, one must be careful to ensure that the observed patterns are due to the phenomenon under investigation rather than confounding factors. We discuss potential confounds arising from lexical overlap, anisotropy of contextual embeddings, and neighboring words.

Lexical overlap. The randomized experiment design ensures that the Jabberwocky words cannot be lexically biased towards any construction, since each verb is equally likely to occur in every construction. Technically, the lexical content in the four constructions are not identical: i.e., words like “from” (occurring only in the removal construction) or “on” (in the caused-motion construction) may provide hints to the sentence meaning. However, the ditransitive and resultative constructions do not contain any such informative words, yet RoBERTa still associates the correct prototype verb for these constructions, so we consider it unlikely to be relying solely on lexical overlap. There is substantial evidence that RoBERTa is able to associate abstract constructional templates with their meaning without lexical cues. This result is perhaps surprising, given that previous work found that LMs are relatively insensitive to word order in compositional phrases (Yu and Ettinger, 2020) and downstream inference tasks (Sinha et al., 2021; Pham et al., 2021), where their performance can be largely attributed to lexical overlap.


. Recent probing work have found that contextual embeddings suffer from anisotropy, where embeddings lie in a narrow cone and have much higher cosine similarity than expected if they were directionally uniform

(Ethayarajh, 2019). Furthermore, a small number of dimensions dominate geometric measures such as Euclidean and cosine distance, resulting in a degradation of representation quality (Kovaleva et al., 2021; Timkey and van Schijndel, 2021). Since our experiments rely heavily on Euclidean distance, anisotropy is a significant concern. Following Timkey and van Schijndel (2021), we perform standardization by subtracting the mean vector and dividing each dimension by its standard deviation, where the mean and standard deviation for each dimension is computed from a sample of the BNC. We observe little difference after standardization: in both the high and low frequency scenarios, the Euclidean distances are lower for the congruent than the incongruent conditions, by a similar margin compared to the original experiment without standardization. We also run standardization on the first case study, and find that the results remain essentially unchanged: smaller LMs still prefer verb sorting while larger LMs prefer construction sorting. Thus, neither of our experiments appear to be affected by anisotropy.

Neighboring words. A final confounding factor is our assumption that RoBERTa’s contextual embeddings represent word meaning, when in reality, they contain a mixture of syntactic and semantic information. Contextual embeddings are known to contain syntax trees (Hewitt and Manning, 2019) and linguistic information about neighboring words in a sentence (Klafka and Ettinger, 2020); although previous work did not consider ASCs, it is plausible that our verb embeddings leak information about the sentence’s construction in a similar manner. If this were the case, the prototype verb embedding for gave would contain not only the semantics of transfer that we intended, but also information about its usual syntactic form666Bresnan and Nikitina (2003) estimated that 87% of usages of the word “give” occur in the ditransitive construction. of “S gave NP1 NP2”, and both would be captured by our Euclidean distance measurement. Controlling for this syntactic confound is difficult – one could alternatively probe for transfer semantics without syntactic confounds using a natural language inference setup (e.g., whether the sentence entails the statement “NP1 received NP2”), but we leave further exploration of this idea to future work.

6.5 Conclusion

We found evidence for argument structure constructions in Transformer language models from two separate angles: sentence sorting and Jabberwocky construction experiments. Our work extended the existing body of literature on LM probing by taking a constructionist instead of generative approach to linguistic probing. Our sentence sorting experiments identified a striking resemblance between humans’ and LMs’ internal language representations as LMs are exposed to increasing quantities of data, despite the differences between neural language models and the human brain. Our two studies suggest that LMs are able to derive meaning from abstract constructional templates with minimal lexical overlap. Both sets of experiments were inspired by psycholinguistic studies, which we adapted to fit the capabilities of LMs – this illustrates the potential for future work on grounding LM probing methodologies in psycholinguistic research.

7.1 Synopsis

In this dissertation, I explored ways in which Transformer-based language models can provide evidence to support theories in linguistics, and how linguistic theory can provide probing frameworks for interpreting language models. My research has built connections between natural language processing and linguistics (drawing on research from both the theoretical and experimental psycholinguistic sides of linguistics). The two fields have much to contribute to each other, so it is worthwhile for researchers of both disciplines to be familiar with the tools and theories of the other, and look for opportunities to apply cross-disciplinary ideas in their own work.

Chapter 1 introduced the problem of interpreting neural language models and the motivations for probing over other evaluation methods. In Chapter 2, I surveyed the models to be probed, beginning with word vector models from the onset of the deep learning revolution, and culminating with the highly engineered Transformer-based models that rank at the top of leaderboards today. In Chapter 3, I reviewed recent linguistic probing research that tested the outputs of language models via behavioural probes and the internals of models via representational probes. Here, a wide range of linguistic phenomena combined with a diverse assortment of probing methods led to many novel results about what linguistic knowledge our models are capable of, and in which areas they remain deficient.

The next three chapters of my thesis contained my own contributions to the field. In Chapter 4, I tackled word class flexibility, a problem that is controversial in linguistic typology because linguists disagree about how it should be analyzed and how it should be compared across languages. My approach used contextual embeddings to argue that word class flexibility should be treated as a directional phenomenon, based on semantic evidence automatically computed from corpora across multiple languages. Chapter 5 explored how different types of linguistic anomalies are represented differently in language models. Inspired by human language processing studies of event-related potentials that trigger depending on the type of anomaly, I devised a method to probe for similar patterns in language models. The experiments revealed a notable difference in how various types of anomalies are represented. Chapter 6 draws on the psycholinguistic literature more directly, adapting several influential experiments to probe language models. The original studies presented evidence for the psychological reality of argument structure constructions in humans, while my results demonstrated their existence in language models, via a similar and parallel methodology.

Sadly, my thesis has come to an end, yet my discoveries leave many questions unanswered and opportunities for further exploration. I will next discuss some promising avenues for future work in the linguistic probing direction. I hope that my work will inspire future collaboration between natural language processing researchers and linguists.

7.2 Future directions

7.2.1 Which models to probe?

When engaging in probing research, a decision must be made at some point about which models to probe. BERT is the most popular choice, but many newer models have surpassed it in performance so that it is no longer state-of-the-art; it can be misleading to present its deficiencies as representative of language models in general, given that newer models may have improved in these aspects (Bowman, 2021). In my work, I used BERT, ELMo, RoBERTa, and XLNet in English experiments, and mBERT, XLM-R, and various monolingual models for non-English experiments; these are more or less the most popular models in the community at this time.

One may wonder how relevant this body of work will be in the future, when BERT and RoBERTa are surpassed by newer models. Indeed, many architectures such as ELECTRA (Clark et al., 2020) and DeBERTa (He et al., 2021) claim improvements over BERT and RoBERTa, but these newer models are rarely the subject of probing research. When probing sentence representations, models dedicated to the task such as Sentence-BERT (Reimers and Gurevych, 2019) are rarely used, despite their superior performance over average pooling over token vectors or taking the [CLS] vector, methods commonly used in probing setups.

Inevitably, newer models will exhibit similar linguistic patterns as current models in some cases, while differing in other cases. In my view, probing work will remain relevant despite newer models that behave differently, because the primary contributions are the novel methodologies to probe for various linguistic phenomena in continuous representations, and not the results of the probing experiments themselves. As long as newer models continue to use similar layers of continuous vectors, it is straightforward to adapt existing linguistic probing tests and obtain an assessment of the capabilities of the new model, using far less effort than inventing probing procedures from scratch.

The trouble is that when a probing procedure gives different results when applied to different models, it is often not possible to explain why, in a satisfactory manner. This limitation applies to Chapter 4 of this dissertation (where XLM-R performed worse than mBERT on judging similarity between noun-verb pairs), as well as Chapter 5 (where XLNet did not exhibit the same difference between anomaly types as RoBERTa). Explaining these differences is problematic because architectural differences are generally far removed from concepts in linguistic theory. For example, Sentence-BERT uses a siamese architecture with triplet loss to learn sentence embeddings; ELECTRA uses a pretraining task of predicting corrupted tokens instead of masked language modelling. Any attempts to find connections to linguistic theory would likely only be speculative.

As language model pretraining becomes more accessible, exploring these differences in a systematic manner will become more feasible. Some recent work investigated the effects of structural versus sequential model architecture (Hu et al., 2020), and genre of training data (Huebner et al., 2021) on probing performance. These experiments require training many variants of models to isolate the effects of each architectural parameter, and should become easier to perform in future work as language model tools and frameworks continue to improve.

7.2.2 Evidence from learnability

A common criticism of neural network probing research is its lack of relevance to linguistic theory (Baroni, 2021). Even as we analyze the linguistic abilities of BERT and other models in increasing detail, this type of work does not lead to an improved understanding of human language processing, so its impact outside of natural language processing will likely be limited. One promising direction is using language models to study learnability (e.g., Wilcox et al. (2021a)), an approach that is currently underexplored. This idea is that when neural networks are able to learn some linguistic feature from corpora alone, that constitutes evidence that no other mechanisms (such as innate grammar or interactions grounded in the real world) are necessary to learn the feature.

Learnability has been featured in studies of argument structure constructions as well. Goldberg et al. (2004)

proposed that learning of ASCs is facilitated when the distribution of verbs in a construction is skewed towards a frequent prototypical verb (for example, “give” for the ditransitive construction), compared to a balanced distribution of several verbs. Their evidence came from studies of a child language corpus and an artificial language learning experiment in which subjects learned novel verbs and constructions. We are limited to indirect studies because it is impractical to manipulate the language input to children over their lifetime for an experiment. Unlike humans, language models can be trained from scratch on artificial data to serve as tools to test learnability hypotheses. In this example, we may train one model using a balanced distribution of verbs and train another model using a skewed distribution of verbs, and compare which one is more successful at learning ASCs by probing them. If such an experiment finds that skewed verb distributions are helpful for language models learning ASCs, the theory of construction learning in humans would be strengthened.

7.2.3 Psycholinguistic-based probing

Language model probing has a lot in common with psycholinguistics: the goal of both fields is to probe the internals of a language processing entity through indirect experimental methods. Psycholinguistics benefits from being a more mature field and closer alignment with linguistic theory, since many psycholinguistic studies are designed to support or refute theories of how language is processed cognitively.

Currently, only a tiny fraction of the numerous psycholinguistic publications in the last few decades have been considered for adaptation to neural network probing. Given this choice, one can either select studies that examine a specific linguistic phenomenon (using different methodologies), or select studies that employ the same methodology (studying different phenomena). In this thesis, I have mostly used the former strategy: Chapter 5 took data from multiple sources that contain linguistic anomalies, and Chapter 6 adapted studies on argument structure constructions. Other authors aggregated psycholinguistic studies using the same methodology, such as Michaelov and Bergen (2020), who focused on the N400 effect, and Prasad et al. (2019), who examined syntactic priming. In either case, adapting psycholinguistic work to neural network probing is an effective way of bridging the gap between theoretical linguistics and natural language processing, thereby improving our understanding of language models through the lens of linguistic theory.


  • M. Abrusán (2019) Semantic anomaly, pragmatic infelicity, and ungrammaticality. Annual Review of Linguistics 5, pp. 329–351. Cited by: §5.4.
  • V. Adams (1973) An introduction to modern english word formation. Longman, London. Cited by: §4.2.1.
  • A. Baicchi and P. Della Putta (2019) Constructions at work in foreign language learners’ mind: a comparison between two sentence-sorting experiments with English and Italian learners. Review of Cognitive Linguistics. Published under the auspices of the Spanish Cognitive Linguistics Association 17 (1), pp. 219–242. Cited by: Figure 6.4, §6.1, §6.2.2, §6.2.2, Table 6.5.
  • I. Balteiro (2007) The directionality of conversion in English: a dia-synchronic study. Linguistics Insights, Vol. 59, Peter Lang. Cited by: §4.2.3.
  • D. Barner and A. Bale (2002) No nouns, no verbs: psycholinguistic arguments in favor of lexical underspecification. Lingua 112, pp. 771–791. Cited by: §4.2.1.
  • M. Baroni (2021) On the proper role of linguistically-oriented deep net analysis in linguistic theorizing. arXiv preprint arXiv:2106.08694. Cited by: §1.2, §3.5, §7.2.2.
  • L. Bauer and S. Valera (Eds.) (2005a) Approaches to conversion/zero-derivation. Waxmann, Münster. Cited by: §4.1.
  • L. Bauer and S. Valera (2005b) Conversion or zero-derivation: an introduction. In Approaches to Conversion/Zero-derivation, L. Bauer and S. Valera (Eds.), pp. 7–18. Cited by: §4.3.3.
  • L. Bauer (2005) Conversion and the notion of lexical category. In Approaches to Conversion/Zero-derivation, L. Bauer and S. Valera (Eds.), pp. 19–30. Cited by: §4.2.4, item 1, §4.6.2.
  • G. M. Bencini and A. E. Goldberg (2000) The contribution of argument structure constructions to sentence meaning. Journal of Memory and Language 43 (4), pp. 640–651. Cited by: Figure 6.1, Figure 6.2, Figure 6.3, §6.1, §6.1, §6.2.2, §6.3.1, §6.3.2, Table 6.1, Table 6.5, footnote 3, footnote 4.
  • K. Bock and H. Loebell (1990) Framing sentences. Cognition 35 (1), pp. 1–39. Cited by: §3.4.2, §6.2.2.
  • K. Bock and C. A. Miller (1991) Broken agreement. Cognitive psychology 23 (1), pp. 45–93. Cited by: §3.2.1.
  • P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov (2017) Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, pp. 135–146. Cited by: §2.2.
  • G. Boleda (2020) Distributional semantics and linguistic theory. Annual Review of Linguistics 6, pp. 213–234. Cited by: Figure 2.1.
  • S. Bowman and G. Dahl (2021) What will it take to fix benchmarking in natural language understanding?. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4843–4855. Cited by: §1.1.
  • S. R. Bowman (2021) When combating hype, proceed with caution. arXiv preprint arXiv:2110.08300. Cited by: §7.2.1.
  • B. Bram (2011) Major total conversion in English: the question of directionality. Ph.D. Thesis, Victoria University of Wellington. Cited by: §4.2.3.
  • J. Bresnan and T. Nikitina (2003) The gradience of the dative alternation. Unpublished manuscript, Stanford University. Cited by: footnote 6.
  • J. Cañete, G. Chaperon, R. Fuentes, J. Ho, H. Kang, and J. Pérez (2020) Spanish pre-trained BERT model and evaluation data. In PML4DC at ICLR 2020, Cited by: footnote 1.
  • T. Cao, C. Huang, D. Y. Hui, and J. P. Cohen (2020) A benchmark of medical out of distribution detection. arXiv preprint arXiv:2007.04250. Cited by: §5.3.1.
  • A. Carnie (2013) Syntax: a generative introduction. John Wiley & Sons. Cited by: §3.2.2, §3.2.2.
  • B. Cetnarowska (1993) The syntax, semantics and derivation of bare nominalizations. Uniwersytet Śla̧ski, Katowice. Cited by: §4.2.4.
  • E. Chersoni, A. T. Urrutia, P. Blache, and A. Lenci (2018) Modeling violations of selectional restrictions with distributional semantics. In Proceedings of the Workshop on Linguistic Complexity and Natural Language Processing, pp. 20–29. Cited by: §5.2, footnote 1.
  • E. A. Chi, J. Hewitt, and C. D. Manning (2020) Finding universal grammatical relations in multilingual BERT. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5564–5577. Cited by: §3.3.2.
  • N. Chomsky (1957) Syntactic structures. Mouton and Co.. Cited by: §1.3, §3.2.3, §5.1.
  • N. Chomsky (1965) Aspects of the theory of syntax. MIT Press. Cited by: §3.2.3, §6.1.
  • N. Chomsky (1981) Lectures on government and binding. Foris Publications, Dordrecht. Cited by: §3.3.3, §6.1, §6.1, §6.2.1.
  • N. Chomsky (1986) Knowledge of language. Praeger, New York. Cited by: §6.1.
  • N. Chomsky (1995) The minimalist program. The MIT Press, Cambridge, Massachusetts. Cited by: §6.1.
  • W. Chow, C. Smith, E. Lau, and C. Phillips (2016) A “bag-of-arguments” mechanism for initial verb predictions. Language, Cognition and Neuroscience 31 (5), pp. 577–596. Cited by: item 7, §5.5.2, Table 5.6, Table 5.7, footnote 1.
  • E. V. Clark and H. H. Clark (1979) When nouns surface as verbs. Language 55 (4), pp. 767–811. Cited by: §4.2.4.
  • K. Clark, M. Luong, Q. V. Le, and C. D. Manning (2020) ELECTRA: pre-training text encoders as discriminators rather than generators. In International Conference on Learning Representations, Cited by: §2.4, §7.2.1.
  • A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov (2020) Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451. Cited by: §2.4.
  • A. Conneau and D. Kiela (2018) SentEval: an evaluation toolkit for universal sentence representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Cited by: §3.3.1.
  • A. Conneau and G. Lample (2019) Cross-lingual language model pretraining. In Advances in Neural Information Processing Systems, pp. 7059–7069. Cited by: §2.4.
  • W. Croft (2003) Typology and universals. 2 edition, Cambridge University Press, Cambridge. Cited by: §4.1.
  • D. A. Cruse (1986) Lexical semantics. Cambridge University Press, Cambridge. Cited by: §4.2.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: Chapter 1, Figure 2.3, §2.4, §4.1, §5.1, §5.3.2, §6.3.1.
  • J. Don (2003) A note on conversion in Dutch and German. Linguistics in the Netherlands, pp. 33–44. Cited by: §4.1.
  • J. Don (2005) On conversion, relisting and zero-derivation: a comment on Rochelle Lieber: English word-formation processes. SKASE Journal of Theoretical Linguistics 2 (2), pp. 2–16. Cited by: §4.2.3.
  • J. Dunietz, L. Levin, and J. G. Carbonell (2015) Annotating causal language using corpus lexicography of constructions. In Proceedings of The 9th Linguistic Annotation Workshop, pp. 188–196. Cited by: §6.2.3.
  • J. Dunn (2017) Computational learning of construction grammars. Language and Cognition 9 (2), pp. 254–292. Cited by: §6.2.3.
  • J. L. Elman (1990) Finding structure in time. Cognitive science 14 (2), pp. 179–211. Cited by: §2.3.
  • K. Ethayarajh (2019) How contextual are contextualized word representations? comparing the geometry of BERT, ELMo, and GPT-2 embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 55–65. Cited by: §6.4.3.
  • A. Ettinger (2020) What BERT is not: lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association for Computational Linguistics 8, pp. 34–48. Cited by: §1.1, §3.4.1, §5.1, §5.2, footnote 1.
  • N. Evans and S. C. Levinson (2009) The myth of language universals: language diversity and its importance for cognitive science. Behavioral and Brain Sciences 32, pp. 429–492. Cited by: §4.1.
  • N. Evans and T. Osada (2005) Mundari: the myth of a language without word classes. Linguistic Typology 9 (3), pp. 351–390. Cited by: §4.1.
  • N. Evans (2000) Word classes in the world’s languages. In Morphologie/morphology: An international handbook on inflection and word-formation, G. Booij, C. Lehmann, and J. Mugdan (Eds.), pp. 708–732. Cited by: §4.1.
  • P. Farell (2001) Functional shift as category underspecification. English Language and Linguistics 5 (1), pp. 109–130. Cited by: §4.2.1, §4.2.3.
  • K. D. Federmeier and M. Kutas (1999) A rose by any other name: long-term memory structure and sentence processing. Journal of memory and Language 41 (4), pp. 469–495. Cited by: item 6, Table 5.6, Table 5.7, footnote 1.
  • G. Fenk-Oczlon, A. Fenk, and P. Faber (2010) Frequency effects on the emergence of polysemy and homophony. International Journal of Information Technologies and Knowledge 4 (2), pp. 103–109. Cited by: §4.2.3.
  • C. J. Fillmore, P. Kay, and M. C. O’Connor (1988) Regularity and idiomaticity in grammatical constructions: the case of let alone. Language 64 (3), pp. 501–538. Cited by: §6.2.1.
  • S. L. Frank, L. J. Otten, G. Galli, and G. Vigliocco (2015) The ERP response to the amount of information conveyed by words in sentences. Brain and language 140, pp. 1–11. Cited by: §3.4.1.
  • D. Gentner (1982) Why nouns are learned before verbs: linguistic relativity versus natural partitioning. Center for the Study of Reading Technical Report; no. 257. Cited by: §4.6.2.
  • K. Gerdes, B. Guillaume, S. Kahane, and G. Perrier (2018) SUD or Surface-Syntactic Universal Dependencies: an annotation scheme near-isomorphic to UD. In Universal Dependencies Workshop 2018, Cited by: §3.3.2.
  • A. E. Goldberg, D. M. Casenhiser, and N. Sethuraman (2004) Learning argument structure generalizations. Cognitive Linguistics 15 (3), pp. 289–316. Cited by: §7.2.2.
  • A. E. Goldberg (2006) Constructions at work: the nature of generalization in language. Oxford University Press, Oxford. Cited by: §6.1, §6.2.1.
  • A. E. Goldberg (1995) Constructions: a construction grammar approach to argument structure. University of Chicago Press. Cited by: §1.3, §6.1, §6.1, §6.2.1, §6.2.1, §6.2.1.
  • C. Gong, D. He, X. Tan, T. Qin, L. Wang, and T. Liu (2018) FRAGE: frequency-agnostic word representation. In Advances in neural information processing systems, pp. 1334–1345. Cited by: §5.3.4.
  • S. T. Gries and S. Wulff (2005) Do foreign language learners also have constructions?. Annual Review of Cognitive Linguistics 3 (1), pp. 182–200. Cited by: §6.1, §6.2.2.
  • K. Gulordava, P. Bojanowski, É. Grave, T. Linzen, and M. Baroni (2018) Colorless green recurrent networks dream hierarchically. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1195–1205. Cited by: §3.2.1, §3.2.2, §5.1, §5.2, §6.4.1.
  • S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. Bowman, and N. A. Smith (2018) Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 107–112. Cited by: §1.1.
  • Z. S. Harris (1954) Distributional structure. Word 10 (2-3), pp. 146–162. Cited by: §2.2.
  • B. Hart and T. R. Risley (2003) The early catastrophe: the 30 million word gap by age 3. American educator 27 (1), pp. 4–9. Cited by: §3.3.3.
  • P. He, X. Liu, J. Gao, and W. Chen (2021) DeBERTa: decoding-enhanced BERT with disentangled attention. In International Conference on Learning Representations, Cited by: §7.2.1.
  • K. Hengeveld (1992) Non-verbal predication: theory, typology, diachrony. Mouton de Gruyter, Berlin, New York. Cited by: §4.1, §4.1.
  • K. Hengeveld (2013) Parts-of-speech systems as a basic typological determinant. In Flexible word classes: a typological study of underspecified parts-of-speech, E. Van Lier and J. Rijkhoff (Eds.), pp. 31–55. Cited by: §4.2.1.
  • L. T. Hennigen, A. Williams, and R. Cotterell (2020) Intrinsic probing through dimension selection. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 197–216. Cited by: §5.3.1.
  • J. Hewitt and P. Liang (2019) Designing and interpreting probes with control tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2733–2743. Cited by: §2.5.
  • J. Hewitt and C. D. Manning (2019) A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4129–4138. Cited by: §1.2, §3.3.2, §5.2, §6.4.3.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.3.
  • J. Howard and S. Ruder (2018) Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 328–339. Cited by: §2.3.
  • J. Hu, J. Gauthier, P. Qian, E. Wilcox, and R. Levy (2020) A systematic assessment of syntactic generalization in neural language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1725–1744. Cited by: §3.2.2, §5.2, §6.1, §7.2.1.
  • P. A. Huebner, E. Sulem, F. Cynthia, and D. Roth (2021) BabyBERTa: learning more grammar with small-scale child-directed language. In Proceedings of the 25th Conference on Computational Natural Language Learning, pp. 624–646. Cited by: §3.3.3, §7.2.1.
  • J. D. Hwang and M. Palmer (2015) Identification of caused motion construction. In Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics, pp. 51–60. Cited by: §6.2.3.
  • C. Iacobini (2000) Base and direction of derivation. In Morphology: An International Handbook on Inflection and Word-Formation, G. Booij, C. Lehmann, and J. Mugdan (Eds.), pp. 865–876. Cited by: §4.1, §4.2.3.
  • M. Imai, L. Li, E. Haryu, H. Okada, K. Hirsh-Pasek, R. M. Golinkoff, and J. Shigematsu (2008) Novel noun and verb learning in Chinese-, English-, and Japanese-speaking children. Child development 79 (4), pp. 979–1000. Cited by: §4.6.2.
  • G. Jawahar, B. Sagot, and D. Seddah (2019) What does BERT learn about the structure of language?. In ACL 2019-57th Annual Meeting of the Association for Computational Linguistics, Cited by: §3.3.1.
  • O. Jespersen (1924) The philosophy of grammar. Allen & Unwin., London. Cited by: §4.2.1.
  • O. Jespersen (1942) A modern english grammar. on historical principles. part vi morphology. Allen & Unwin., London. Cited by: §4.2.4.
  • M. A. Johnson and A. E. Goldberg (2013) Evidence for automatic accessing of constructional meaning: Jabberwocky sentences prime associated verbs. Language and Cognitive Processes 28 (10), pp. 1439–1452. Cited by: §6.1, §6.1, §6.2.2, §6.2.3, §6.4.1, §6.4, Table 6.6, footnote 4, footnote 5.
  • R. M. Kaplan and J. Bresnan (1982) Lexical functional grammar: A formal system for grammatical representation. In The Mental Representation of Grammatical Relations, J. Bresnan (Ed.), pp. 173–282. Cited by: §6.1, §6.2.1.
  • M. P. Kaschak and A. M. Glenberg (2000) Constructing meaning: the role of affordances and grammatical constructions in sentence comprehension. Journal of memory and language 43 (3), pp. 508–529. Cited by: §6.1, §6.2.2.
  • D. Kastovsky (2006) Typological changes in derivational morphology. In The Handbook of the History of English, A. van Kemenade and B. Los (Eds.), pp. 151–176. Cited by: §4.3.3.
  • P. Kay and C. J. Fillmore (1999) Grammatical constructions and linguistic generalizations: the What’s X Doing Y? construction. Language 75 (1), pp. 1–33. Cited by: §6.2.1.
  • M. Kelly, Y. Xu, J. Calvillo, and D. Reitter (2020) Which sentence embeddings and which layers encode syntactic structure?. In Cognitive Science, pp. 2375–2381. Cited by: §3.3.1, §5.2.
  • S. Kirsch (2019) The psychological reality of argument structure constructions: a visual world eye tracking study. Unpublished MSc thesis, University of Freiburg. Cited by: Figure 6.4, §6.2.2, §6.3.2, Table 6.5.
  • M. Kisselew, L. Rimell, A. Palmer, and S. Padó (2016) Predicting the direction of derivation in English conversion. In Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pp. 93–98. Cited by: §4.2.3.
  • J. Klafka and A. Ettinger (2020) Spying on your neighbors: fine-grained probing of contextual embeddings for information about surrounding words. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4801–4811. Cited by: §6.4.3.
  • O. Kovaleva, S. Kulshreshtha, A. Rogers, and A. Rumshisky (2021)

    BERT busters: outlier dimensions that disrupt Transformers

    In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 3392–3405. Cited by: §6.4.3.
  • A. Kulmizev, V. Ravishankar, M. Abdou, and J. Nivre (2020) Do neural language models show preferences for syntactic formalisms?. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4077–4091. Cited by: §3.3.2.
  • M. Kutas and K. D. Federmeier (2011) Thirty years and counting: finding meaning in the N400 component of the event-related brain potential (ERP). Annual review of psychology 62, pp. 621–647. Cited by: §3.4.1, §5.5.1.
  • M. Kutas, C. K. Van Petten, and R. Kluender (2006) Psycholinguistics electrified II (1994–2005). In Handbook of psycholinguistics, pp. 659–724. Cited by: §1.3, §3.4.1, §5.1, §5.5.1.
  • I. Kuznetsov and I. Gurevych (2020) A matter of framing: The impact of linguistic formalism on probing results. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 171–182. Cited by: §3.3.2.
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2020) ALBERT: a lite BERT for self-supervised learning of language representations. In International Conference on Learning Representations, Cited by: §2.4.
  • S. Lappin (2021) Deep learning and linguistic representation. CRC Press. Cited by: §1.2.
  • J. H. Lau, A. Clark, and S. Lappin (2017) Grammaticality, acceptability, and probability: a probabilistic view of linguistic knowledge. Cognitive science 41 (5), pp. 1202–1241. Cited by: §3.2.1, §3.2.3.
  • G. E. Lebani and A. Lenci (2016) “Beware the Jabberwock, dear reader!” Testing the distributional reality of construction semantics. In Proceedings of the 5th Workshop on Cognitive Aspects of the Lexicon (CogALex-V), pp. 8–18. Cited by: §6.2.3.
  • K. Lee, K. Lee, H. Lee, and J. Shin (2018) A simple unified framework for detecting out-of-distribution samples and adversarial attacks. Advances in Neural Information Processing Systems 31, pp. 7167–7177. Cited by: §5.3.1.
  • G. Leech, R. Garside, and M. Bryant (1994) CLAWS4: the tagging of the British National Corpus. In COLING 1994 Volume 1: The 15th International Conference on Computational Linguistics, Cited by: §4.3.2.
  • G. N. Leech (1992) 100 million words of English: the British National Corpus (BNC). Language Research 28, pp. 1–13. Cited by: §4.3.2, §5.3.2, §6.4.1.
  • A. Lenci (2018) Distributional models of word meaning. Annual review of Linguistics 4, pp. 151–171. Cited by: §2.2.
  • B. Levin and M. Rappaport Hovav (1995) Unaccusativity in the syntax-lexical semantics interface. MIT Press, Cambridge, MA. Cited by: §6.2.1.
  • B. Levin and M. Rappaport Hovav (2005) Argument realization. Cambridge University Press, Cambridge, UK. Cited by: §6.1.
  • B. Li, G. Thomas, Y. Xu, and F. Rudzicz (2020) Word class flexibility: a deep contextualized approach. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 983–994. Cited by: Chapter 4.
  • B. Li, Z. Zhu, G. Thomas, F. Rudzicz, and Y. Xu (2022) Neural reality of argument structure constructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Online. Cited by: Chapter 6.
  • B. Li, Z. Zhu, G. Thomas, Y. Xu, and F. Rudzicz (2021) How is BERT surprised? Layerwise detection of linguistic anomalies. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, pp. 4215–4228. Cited by: Chapter 5, §6.4.2.
  • J. Liang (2002) Sentence comprehension by Chinese learners of English: Verb-centered or construction-based?. Unpublished MA thesis, Guangdong University of Foreign Studies. Cited by: Figure 6.2, §6.1, §6.2.2, §6.2.2.
  • T. Linzen, E. Dupoux, and Y. Goldberg (2016) Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics 4, pp. 521–535. Cited by: §3.2.1, §5.2.
  • L. Z. Liu, Y. Wang, J. Kasai, H. Hajishirzi, and N. A. Smith (2021) Probing across time: What does RoBERTa know and when?. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: Findings, Cited by: §3.3.3, §6.3.2.
  • N. F. Liu, M. Gardner, Y. Belinkov, M. E. Peters, and N. A. Smith (2019a) Linguistic knowledge and transferability of contextual representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1073–1094. Cited by: §3.3.1, §6.3.1.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019b) RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: Chapter 1, §2.4, §5.1, §5.3.2, Figure 6.2, §6.3.1.
  • H. T. Madabushi, L. Romain, D. Divjak, and P. Milin (2020) CxGBERT: BERT meets construction grammar. In Proceedings of the 28th International Conference on Computational Linguistics, pp. 4020–4032. Cited by: §6.2.3.
  • A. Madsen, S. Reddy, and S. Chandar (2021) Post-hoc interpretability for neural NLP: a survey. arXiv preprint arXiv:2108.04840. Cited by: §1.1.
  • S. Manova (2011) Understanding morphological rules. Springer. Cited by: §4.1.
  • H. Marchand (1964) A set of criteria of derivational relationship between words unmarked by derivational morphemes. Indogermanische Forschungen 69, pp. 10–19. Cited by: §4.1, §4.2.3.
  • H. Marchand (1969) The categories and types of present-day english word-formation. Beck, München. Cited by: §4.2.1, §4.2.4.
  • M. Marcus, B. Santorini, and M. A. Marcinkiewicz (1993) Building a large annotated corpus of English: the Penn Treebank. Cited by: §6.4.1.
  • R. Marvin and T. Linzen (2018) Targeted syntactic evaluation of language models. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1192–1202. Cited by: §3.2.2.
  • E. Metheniti, T. Van de Cruys, and N. Hathout (2020) How relevant are selectional preferences for Transformer-based language models?. In Proceedings of the 28th International Conference on Computational Linguistics, pp. 1266–1278. Cited by: §5.2.
  • A. Miaschi, D. Brunato, F. Dell’Orletta, and G. Venturi (2020) Linguistic profiling of a neural language model. The 28th International Conference on Computational Linguistics, pp. 745–756. Cited by: §1.2, §3.3.1.
  • J. Michaelov and B. Bergen (2020) How well does surprisal explain N400 amplitude under different experimental conditions?. In Proceedings of the 24th Conference on Computational Natural Language Learning, pp. 652–663. Cited by: §3.4.1, §7.2.3.
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §2.2.
  • K. Misra, A. Ettinger, and J. Rayz (2020) Exploring BERT’s sensitivity to lexical cues using tests from semantic priming. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 4625–4635. Cited by: §3.4.2, §6.4.
  • J. Niu and G. Penn (2020) Grammaticality and language modelling. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, pp. 110–119. Cited by: §3.2.3.
  • L. Osterhout and L. A. Mobley (1995) Event-related brain potentials elicited by failure to agree. Journal of Memory and language 34 (6), pp. 739–773. Cited by: item 5, Table 5.6, Table 5.7.
  • L. Osterhout and J. Nicol (1999) On the distinctiveness, independence, and time course of the brain responses to syntactic and semantic anomalies. Language and cognitive processes 14 (3), pp. 283–317. Cited by: item 2, Table 5.6, Table 5.7.
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. (2011) Scikit-learn: Machine learning in Python. Journal of machine Learning research 12, pp. 2825–2830. Cited by: §6.3.1.
  • J. Pennington, R. Socher, and C. D. Manning (2014) GloVe: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §2.2, §4.4.1.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proceedings of NAACL-HLT, pp. 2227–2237. Cited by: §4.1.
  • T. Pham, T. Bui, L. Mai, and A. Nguyen (2021) Out of order: how important is the sequential order of words in a sentence in natural language understanding tasks?. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, pp. 1145–1160. Cited by: §6.4.3.
  • M. J. Pickering and V. S. Ferreira (2008) Structural priming: a critical review.. Psychological bulletin 134 (3), pp. 427. Cited by: §3.4.2, §6.2.2.
  • A. Podolskiy, D. Lipin, A. Bout, E. Artemova, and I. Piontkovskaya (2021) Revisiting Mahalanobis distance for Transformer-based out-of-domain detection. In 35th AAAI Conference on Artificial Intelligence (AAAI 2021), Cited by: §5.3.1.
  • M. Polinsky (2012) Headedness, again. Theories of Everything. In Honor of Ed Keenan. Los Angeles: UCLA Department of Linguistics, pp. 348–359. Cited by: §4.6.1.
  • C. Pollard and I. A. Sag (1987) Information-based syntax and semantics. Vol. 13, CSLI, Stanford. Cited by: §6.1, §6.2.1.
  • M. Poulsen (2012) The usefulness of the grammaticality–acceptability distinction in functional approaches to language. Acta Linguistica Hafniensia 44 (1), pp. 4–21. Cited by: §5.4.
  • G. Prasad, M. van Schijndel, and T. Linzen (2019) Using priming to uncover the organization of syntactic representations in neural language models. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pp. 66–76. Cited by: §3.4.2, §6.4, §7.2.3.
  • L. Pylkkänen and B. McElree (2007) An MEG study of silent meaning. Journal of cognitive neuroscience 19 (11), pp. 1905–1921. Cited by: item 3, §5.5.1, Table 5.6, Table 5.7, footnote 1.
  • R. Quirk, S. Greenbaum, G. Leech, and J. Svartvik (1985) A comprehensive grammar of the english language. Longman, London. Cited by: §4.2.1.
  • E. Rabinovich, J. Watson, B. Beekhuizen, and S. Stevenson (2019) Say anything: automatic semantic infelicity detection in L2 English indefinite pronouns. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pp. 77–86. Cited by: §5.2.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. OpenAI Technical Report. Cited by: Chapter 1, §2.4.
  • P. Rajpurkar, R. Jia, and P. Liang (2018) Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 784–789. Cited by: §1.1.
  • N. Reimers and I. Gurevych (2019) Sentence-BERT: sentence embeddings using siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992. Cited by: §7.2.1.
  • A. Rogers, O. Kovaleva, and A. Rumshisky (2021) A primer in BERTology: what we know about how BERT works. Transactions of the Association for Computational Linguistics 8, pp. 842–866. Cited by: §2.5.
  • J. Salazar, D. Liang, T. Q. Nguyen, and K. Kirchhoff (2020) Masked language model scoring. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2699–2712. Cited by: §2.4, §5.3.2, footnote 2.
  • R. Sasano and A. Korhonen (2020) Investigating word-class distributions in word vector spaces. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3657–3666. Cited by: §5.2.
  • B. Schölkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and J. C. Platt (2000)

    Support vector method for novelty detection

    In Advances in neural information processing systems, pp. 582–588. Cited by: §5.3.3.
  • C. T. Schütze (1996) The empirical base of linguistics: grammaticality judgments and linguistic methodology. University of Chicago Press. Cited by: §6.1.
  • K. Sinha, P. Parthasarathi, J. Pineau, and A. Williams (2021) UnNatural Language Inference. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, pp. 7329–7346. Cited by: §6.1, §6.4.3.
  • P. Ştekauer, S. Valera, and L. Körtvélyessy (2012) Word-formation in the world’s languages. Cambridge University Press. Cited by: §4.1.
  • M. Straka and J. Straková (2017) Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UDPipe. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 88–99. Cited by: §4.3.2.
  • I. Tenney, D. Das, and E. Pavlick (2019a) BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4593–4601. Cited by: §2.5, §3.3.1, §4.1, §5.2, §5.5.1.
  • I. Tenney, P. Xia, B. Chen, A. Wang, A. Poliak, R. T. McCoy, N. Kim, B. Van Durme, S. Bowman, D. Das, et al. (2019b) What do you learn from context? probing for sentence structure in contextualized word representations. In 7th International Conference on Learning Representations, ICLR 2019, Cited by: Figure 2.4, §2.5, §2.5.
  • W. Timkey and M. van Schijndel (2021) All bark and no bite: rogue dimensions in transformer language models obscure representational quality. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 4527–4546. Cited by: §6.4.3.
  • D. Tuggy (1993) Ambiguity, polysemy, and vagueness. Cognitive Linguistics 4 (3), pp. 273–290. Cited by: §4.2.2.
  • T. P. Urbach and M. Kutas (2010) Quantifiers more or less quantify on-line: ERP evidence for partial incremental interpretation. Journal of Memory and Language 63 (2), pp. 158–179. Cited by: item 8, Table 5.6, Table 5.7.
  • S. Valera and A. E. Ruz (2020) Conversion in English: homonymy, polysemy and paronymy. English Language and Linguistics, pp. 1–24. Cited by: §4.2.2.
  • E. Van Lier and J. Rijkhoff (Eds.) (2013) Flexible word classes: a typological study of underspecified parts-of-speech. Oxford University Press Oxford. Cited by: §1.3, §4.1.
  • M. van Schijndel and T. Linzen (2021) Single-stage prediction models do not explain the magnitude of syntactic disambiguation difficulty. Cognitive Science 45 (6), pp. e12988. Cited by: §3.5.
  • M. van Schijndel, A. Mueller, and T. Linzen (2019) Quantity doesn’t buy quality syntax with neural language models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 5831–5837. Cited by: §1.1, §3.3.3.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: Figure 2.3, §2.4.
  • M. M. Vázquez (2004) Learning argument structure generalizations in a foreign language. Vigo International Journal of Applied Linguistics (1), pp. 151–165. Cited by: Figure 6.4, §6.2.2, Table 6.5.
  • P. Vogel and B. Comrie (Eds.) (2000) Approaches to the typology of word classes. Mouton de Gruyter, Berlin and New York. Cited by: §4.1.
  • E. Voita and I. Titov (2020) Information-theoretic probing with minimum description length. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 183–196. Cited by: §2.5.
  • A. M. Vonen (1994) Multifunctionality and morphology in Tokelau and English. Nordic Journal of Linguistic 17, pp. 155–178. Cited by: §4.1.
  • A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2019a) SuperGLUE: a stickier benchmark for general-purpose language understanding systems. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 3266–3280. Cited by: §1.1.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2019b) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019, Cited by: §3.2.2.
  • T. Warren, E. Milburn, N. D. Patson, and M. W. Dickey (2015) Comprehending the impossible: what role do selectional restriction violations play?. Language, cognition and neuroscience 30 (8), pp. 932–939. Cited by: item 4, §5.4, §5.5.2, Table 5.6, Table 5.7, footnote 1.
  • A. Warstadt and S. R. Bowman (2020) Can neural networks acquire a structural bias from raw linguistic data?. In Proceedings of the Annual Meeting of the Cognitive Science Society, Cited by: §3.3.3.
  • A. Warstadt, A. Parrish, H. Liu, A. Mohananey, W. Peng, S. Wang, and S. R. Bowman (2020a) BLiMP: the benchmark of linguistic minimal pairs for English. Transactions of the Association for Computational Linguistics 8, pp. 377–392. Cited by: §1.1, §3.2.2, Table 3.1, §5.1, §5.1, §5.2, §5.3.2, item 1, §6.1.
  • A. Warstadt, A. Singh, and S. R. Bowman (2019) Neural network acceptability judgments. Transactions of the Association for Computational Linguistics 7, pp. 625–641. Cited by: §3.2.2.
  • A. Warstadt, Y. Zhang, X. Li, H. Liu, and S. R. Bowman (2020b) Learning which features matter: RoBERTa acquires a preference for linguistic generalizations (eventually). In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §3.3.3, §3.3.3, Figure 6.2, §6.3.1, §6.3.2.
  • J. Wei, D. Garrette, T. Linzen, and E. Pavlick (2021) Frequency effects on syntactic rule learning in transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §6.4.2.
  • J. C. White, T. Pimentel, N. Saphra, and R. Cotterell (2021) A non-linear structural probe. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 132–138. Cited by: §3.3.2.
  • E. Wilcox, R. Futrell, and R. Levy (2021a) Using computational models to test syntactic learnability. Lingbuzz Preprint: lingbuzz/006327. Cited by: §3.5, §7.2.2.
  • E. Wilcox, P. Vani, and R. Levy (2021b) A targeted assessment of incremental processing in neural language models and humans. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, pp. 939–952. Cited by: §3.2.3.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush (2019) HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv abs/1910.03771. Cited by: footnote 2, §5.3.2.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pp. 5753–5763. Cited by: §2.4, §5.1, §5.3.2.
  • L. Yu and A. Ettinger (2020) Assessing phrasal representation and composition in transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4896–4907. Cited by: §1.1, §3.3.1, §6.1, §6.4.3.
  • D. Zeman, J. Nivre, M. Abrams, et al. (2019) Universal dependencies 2.5. Note: LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University Cited by: §3.3.2, §4.1, §4.3.2.
  • Y. Zhang, A. Warstadt, H. Li, and S. R. Bowman (2021) When do you need billions of words of pretraining data?. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, Cited by: §3.3.3, §6.3.2.
  • J. Ziegler, G. Bencini, A. Goldberg, and J. Snedeker (2019) How abstract is syntax? Evidence from structural priming. Cognition 193, pp. 104045. Cited by: §6.1, §6.2.2.
  • G. K. Zipf (1949) Human behavior and the principle of least effort: an introduction to human ecology. Addison-Wesley Press. Cited by: §4.2.3.