Log In Sign Up

Word meaning in minds and machines

by   Brenden M. Lake, et al.

Machines show an increasingly broad set of linguistic competencies, thanks to recent progress in Natural Language Processing (NLP). Many algorithms stem from past computational work in psychology, raising the question of whether they understand words as people do. In this paper, we compare how humans and machines represent the meaning of words. We argue that contemporary NLP systems are promising models of human word similarity, but they fall short in many other respects. Current models are too strongly linked to the text-based patterns in large corpora, and too weakly linked to the desires, goals, and beliefs that people use words in order to express. Word meanings must also be grounded in vision and action, and capable of flexible combinations, in ways that current systems are not. We pose concrete challenges for developing machines with a more human-like, conceptual basis for word meaning. We also discuss implications for cognitive science and NLP.


page 19

page 22

page 23


Generalized Optimal Linear Orders

The sequential structure of language, and the order of words in a senten...

Are distributional representations ready for the real world? Evaluating word vectors for grounded perceptual meaning

Distributional word representation methods exploit word co-occurrences t...

(Re)construing Meaning in NLP

Human speakers have an extensive toolkit of ways to express themselves. ...

Deep Sequence Models for Text Classification Tasks

The exponential growth of data generated on the Internet in the current ...

Human-like general language processing

Using language makes human beings surpass animals in wisdom. To let mach...

Human few-shot learning of compositional instructions

People learn in fast and flexible ways that have not been emulated by ma...

Decrypting Cryptic Crosswords: Semantically Complex Wordplay Puzzles as a Target for NLP

Cryptic crosswords, the dominant English-language crossword variety in t...

1 Introduction

Psychological semantics is the study of how people represent the meanings of words and then build sentence meaning out of those representations. People use language dozens of time a day—to have conversations and give instructions, to read and write, to label objects and teach. A theory of psychological semantics must provide the basis for how people do all those things, choosing which words to use and understanding the words they read or hear. In this article we focus on the mental representation of word meaning.

Human language is still the gold standard for a communication system, but artificial intelligence (AI) systems have made important progress in language use. Research on

Natural Language Processing

(NLP) develops systems that understand language to the degree that computers can carry out useful tasks. As described below, such systems use vast text corpora to learn about words, using neural networks and other statistical models. The recent explosion of research in NLP, driven largely by advances in neural networks (also called

deep learning), has resulted in continuously improving performance on various benchmarks that require interpreting words and sentences. Systems are now used in interfaces with customers to make sales or solve problems. Some systems even perform tasks that were historically assumed to be solely within the purview of humans, such as translation, summarization, question answering, and natural language inference.

One way to think about such progress is merely in terms of engineering: There is a job to be done, and if the system does it well enough, it is successful. Engineering is important, and it can result in better and faster performance and relieve humans of dull labor such as keying in answers or making airline itineraries or buying socks. Ultimately, tasks such as machine translation, automatic summarization, and human-machine communication may change our world for the better. Doing these things can certainly be described as semantic processing, but they may not be the same semantic processing that human speakers and listeners engage in.

The continuous progress of these systems has led some researchers to suggest that they are potential models of psychological semantics (Section 4.3). That is, the representations that they derive for words are functionally similar to those that people derive through language learning. A stronger (and more implausible) claim is that the way people learn word meanings is similar to the way models do. We will focus on the first claim in this paper, which is perhaps surprisingly held by a number of psychologists, or at least considered as a reasonable hypothesis. Many AI researchers do not dwell on whether their models are human-like. If someone could develop a highly accurate machine translation system, few would complain that it doesn’t do things the way human translators do. We will argue that contemporary NLP techniques may indeed do many things well, but models will need to push beyond current trends in order to provide a theory of psychological semantics.

We will not suggest that NLP should redirect its efforts to building models of psychological semantics. As we discuss below, NLP technologies have been tremendously successful at many language tasks, without worrying about the plausibility of their semantic representations. For many applications, this engineering-driven approach will be sufficient. In other cases, we see strong potential for improving NLP systems by taking a more psychological approach to word meaning. That is, although NLP models keep getting more successful, there may be limits on their performance in comprehending and producing language that could be overcome with representations that are more like those people have.

Coverage of this article

We begin by briefly reviewing theories of word meaning from the psychological literature (Section 2) and introducing desiderata for models of psychological semantics (Section 3

). We then cover NLP approaches to learning word representations, starting with those initially constructed by psychologists and then moving to more recent NLP models in Machine Learning that learn both word and sentence representations (Section

4). Our main question is whether NLP systems are likely to be successful in representing human semantic knowledge. We argue that current means of representing words are useful for modeling word similarity, although the details don’t always align with human semantic similarity (Section 6). These word representations, however, are not adequate to support the flexible behaviors for which people rely on their semantic representations. We discuss five such classes of human behavior and the challenges they present for models of word meaning (Section 7). We end by discussing the implications of building more sophisticated models of psychological semantics, both for understanding the mind and for advancing NLP (Section 8).

2 Semantics in Cognitive Science

Before discussing word meaning in NLP, we will first take a brief tour of the approaches to semantics in linguistics and psychology in order to understand what it is that such theories must explain. Within most of philosophy and linguistics, semantics is referential. That is, linguistic meaning is analyzed as a relationship between words and the world, and sentence meaning describes a state of affairs that can be mapped to situations in the world (e.g., Chierchia  McConnell-Ginet, 1990). For example, a word like dog has a meaning that allows you to pick out all the dogs in the world. To a first approximation, the meaning is the set of all such dogs, and if you use the word to refer to a member of that set, you are using it correctly (and literally). People who fully understand the meaning of dog

would name all and only dogs with the word (excepting uninteresting cases such as not fully seeing the object, being incapacitated in some way, etc.). A problem with this view, however, is that dogs are coming into and going out of existence at a rapid rate. Many thousands of dogs are born and die every day. Thus, the set of dogs is constantly changing from moment to moment. That is not a very stable basis for a word meaning, which doesn’t intuitively seem to be changing at all. Indeed, an implication of a simple referential theory would be that the meaning of

dog is today completely disjoint from what it was 30 years ago, assuming that no member of that set of dogs is still with us today. That does not seem correct.

For this reason, formal linguists have developed more complicated analyses of meaning, such as claiming that dog picks out sets of objects in an infinite number of possible worlds (see Dowty ., 1981). These worlds are keyed to time and context, such that the extension of dog varies depending on the circumstances. Such a conception also allows us to refer to dogs within possible worlds that do not actually exist, such as hypothetical ones (“If there were no cats, dogs would still find something to chase.”) or fictional ones. Hypothetical and fictional situations are simply more possible worlds, which speakers may refer to.

Why do (some) linguists insist on these kinds of analyses, which often result in seemingly unilluminating statements such as “The meaning of dog is the denotation [[dog]]”? The reason is an extremely powerful one, namely that the utility of language is in its ability to provide information about actual things in the world—to draw our attention to those things, learn about them, and then to take action on them (see Chierchia  McConnell-Ginet, 1990, Ch. 1). Language is not a parlor game in which we only utter formulas of statement and response. Language presumably evolved because of situations in which people can say things like, “Look for blueberries on the other side of the hill,” or “Watch out for that car!” or “I love you and want to spend my life with you.” The significance of such statements lies in their ability to communicate life-saving and life-improving information. Information must relate to the world if it is to be helpful. Talking about blueberries is only useful if doing so actually directs us to a particular kind of edible fruit; warning about a car is only helpful if the hearer then looks out for a car, rather than for blueberries. If word meanings did not relate to our world, they would not be helpful.

The fact that language refers to the world seems indisputable, but exactly how to capture that relation is not as clear. For psychologists, the problem with the referential approach to meaning is that the possible-world semantics cannot be something that humans do in their heads. Even if a speaker is accurate in her use of the word dog, she cannot keep the entire set of the current world’s dogs in her head, much less the sets of dogs in all of the infinite number of possible worlds. Philosophers since Frege have talked about another aspect of meaning besides the denotative aspect: intensions. The intension of a word is its “mode of presentation,” as Frege (1892/1960) referred to it, which can be understood as a way of thinking about the word. For many psychologists, this has been interpreted as a kind of mental description. One cannot keep a representation of every dog in the world in her head, but she can keep a description of what dogs are like, which then enables her to apply the word to them. One knows that dogs have four legs, teeth, and fur, weigh 5 to 75 pounds, like to chase smaller animals and cars, bark, eat scraps that fall on the floor, and so on. One has detailed knowledge of the faces, colors, proportions, and sounds of dogs, along with memories of individual dogs. Although one might argue that some of this knowledge is not strictly part of the meaning of dog (e.g., that dogs are likely to snap up food that falls on the floor), such knowledge allows one to pick out many (though likely not all) of the dogs in the world under variable circumstances. Thus, this mental description of the word allows people to refer to things in the world and to communicate with other people who have similar such descriptions associated to the word.

Language directly connects to our knowledge of the world, as attempts to make realistic language processing systems discovered early on (Schank  Abelson, 1977). For example, imagine the following conversation.

Marjorie: I can’t come to the reception after the talk, because of Fred.
Todd: Fred?
Marjorie: Fred is my dog.

The mere statement that Fred is Marjorie’s dog explains many things. We realize that dogs require frequent maintenance. They must be fed regularly, and they usually must be let outside multiple times a day. Furthermore, dogs are social creatures and do not take well to being left alone for very long stretches of time. None of this is actually said in that conversation, yet Marjorie expects that Todd will understand at least some of these things and therefore infer that she can’t go to the reception because of her need to let Fred out, feed him, and so on. That is, merely by saying the word dog

, Marjorie allows Todd to access his knowledge of the dogs and their properties, which he can then use to make necessary inferences to understand her explanation. If language did not have this property, then conversations would be inordinately long and laborious. Although these facts about dogs are probably not all part of the meaning of

dog, word meaning must connect to these facts and, more generally, to our knowledge of the world, which then allows us to draw necessary inferences and learn new facts.

In psychology, the predominant approach to word meaning is that it is a mapping of words onto conceptual structure (Murphy, 2002, ch. 11, for review). That is, people have concepts that are the building blocks of their world knowledge, and the meaning of a word is essentially a pointer to some subpart of that knowledge. Concepts make this connection between language and the world, as argued for in philosophical and linguistic approaches to semantics, but in a psychologically plausible way. That is, when someone tells you, “Watch out for the car,” the word car activates a concept, which contains information about what cars look like, enabling you to identify the object you’re supposed to watch out for. The concept also contains information about why cars might be dangerous, what they do, and where they go. So, one plausible response to this warning would be to jump back onto the sidewalk, since you know that cars almost always drive on the road.

In summary, the conceptual approach to word meaning has two advantages as a psychological explanation over purely referential approaches. First, we know that people have concepts and knowledge of their world, so this is a plausible psychological representation, unlike infinite sets of objects in possible worlds. Second, in this account, words are connected to the world, because words are connected to concepts, which are in turn connected to the world, through perceptual and motor mechanisms. We use concepts to classify and think about objects in the world even when we are not talking about them. When we see a dog, that activates various perceptual representations that eventually activate our concept of dogs, which could then result in a verbal remark, like “There’s the dog,” or “I didn’t know you have a dog,” or an action, such as petting the animal. That is, use of the word

dog is causally connected to the presence of an actual dog. (The exact nature of this causal connection is a matter of debate among philosophers.) Of course, speakers and writers can discuss objects and situations that are not currently present, but the words used gain their meaning in part through their connections to concepts that are in fact linked to the world. When a speaker gives you new information through language, that changes your representation of the world and will potentially be useful to you later on.

There is considerable agreement regarding the conceptual approach among researchers of word learning in particular, starting with seminal proposals of word learning by Eve Clark (1983) and Susan Carey (1978). Indeed, publications often vary between talking about word learning and concept learning as if they are the same thing. And, of course, they often are. Perhaps your child can correctly label cows when you go to visit a farm. However, you might also correct the child in some cases by saying, “No, that’s not a cow, it’s a goat.” By introducing a new word, goat, you are encouraging your child to note the differences between that referent and the cows: smaller size, different proportions, longer head, possibly a beard, and so on, thereby forming a new concept. If you had never mentioned the word goat, your child might not have formed the concept of goats and continued to include them in a broader category of cows. The requirement to use words the way adults do can act as a stimulus for children to distinguish objects in the world a certain way (Mervis, 1987). That is, word learning can influence one’s concepts, because word learning involves learning what kinds of things there are in the world.

3 Desiderata for a model of psychological semantics

We have discussed the theoretical basis for this view of meaning in some detail, because it is exactly this connection to world knowledge, the world, and language use that we argue is largely missing in current NLP systems. In order to be an adequate theory of psychological semantics, a proposed representation must provide the basis for carrying out a number of flexible behaviors—physical, verbal, and mental—that rely on conceptual representation as summarized in the list of five desiderata in Table 1. We provide short examples of the entries here; the remainder of our article goes into further detail. We should emphasize that these desiderata are by no means exhaustive. Most of them relate to the most basic functions of language use that have been studied extensively in cognitive science. A model that can accomplish all those things might still be a long way from talking and understanding as people do, but a model that cannot do one or more of these things would not understand words as people do.

Imagine you are at the dinner table with family. You might look at the table and say, “That knife is dirty.” To produce this description (Table 1; #1), you clearly must recognize some key objects—place settings, silverware, residual food crust—along with their relations. You then drew attention to a property of one of them, in particular, that one knife is unsuitable for use. Someone listening might replace the offending knife with a new knife, thereby carrying out your (implicit) instruction (#5). That person had to figure out what kind of thing you had in mind by saying “knife” and get that kind of thing instead of a spoon, plate, or other handy object.

If you don’t care for the kind of knife someone hands you, you might say, “I was hoping for a butter knife.” When you said that sentence, you had formed a goal of what kind of thing you wanted and then translated that idea into the English phrase “butter knife” (#2) so that people would know what to get. So, even with this simple interchange, we can see that words can be activated by perceptual input and by mental representations, and that they in turn can cause people to form a particular representation and to take action in the world. Your listener must understand the conceptual combination “butter knife” as indicating a shorter, blunt knife specifically made for butter or cheese (#3). Such phrases are constructed on the fly in English, and speakers make up and understand novel ones in everyday use. Finally, if you choose to, you might tell the children at the table, “Forks go on the left, knives on the right,” which is not a description of that particular table but rather information about knives and forks that you hope (vainly) the children will add to their store of knowledge about the world (#4). If you are successful, this could some day result in the children placing forks on the left (i.e., changing the future world) when they set the table.

Behavior to be explained Examples
Describe a perceptually present scenario, or understand such a description. That knife is in the wrong place. The orangutan is using a makeshift umbrella.
Choose words on the basis of internal desires, goals, or plans. I am looking for a knife to cut the butter. Book a flight from New York to Miami.
Produce and understand novel conceptual combinations. That’s a real apartment dog. The apple train left the orchard.
Change one’s beliefs about the world based on linguistic input. Sharks are fish but dolphins are mammals. Umbrellas fail in winds over 18 knots.
Respond to instructions appropriately. Pick up the knife carefully. Find an object that is not the small ball.
Table 1: Semantic representations support these basic functions of language use.

4 Computational Approaches to Word Meaning

This now leads us to consider some typical examples of computational approaches to meaning, which will contrast greatly with what we have outlined above. The words semantics and meaning do not belong to anyone; there is no law saying that researchers in one field must use the words in the way another field dictates. Thus, when we point out these differences, we are not criticizing one or the other field for not conforming to psychological or linguistic usage. However, it is important to see what those differences are, so that there will not be confusion about which problems in “semantics” have been solved when the term is used differently by different researchers. If one proposes a theory of psychological semantics, it must be the kind of theory that can do at least the things we described above: Explain which words people choose when speaking and how they understand words when listening, and provide a basis for language changing our knowledge about the world. Here we focus on approaches to word meaning that derive meanings from text relations, which have been described as theories of semantics by some of their proponents.

4.1 Word Representations

A classic approach to word meaning is distributional semantics, the idea that words have similar meanings if they have similar patterns of usage and co-occurrence with other words (Harris, 1954; Firth, 1957). An influential text-based model based on these principles was the early work of Thomas Landauer and his colleagues in forming Latent Semantic Analysis (LSA) (Landauer  Dumais, 1997) and the related system HAL (HyperAnalog to Language; Lund  Burgess, 1996). We will focus here on LSA, which was developed much more extensively. Although such models are no longer state-of-the-art, LSA is still commonly used in psycholinguistic research as a measure of relatedness of words or texts.111A Google Scholar search shows that the original Landauer  Dumais (1997) paper was cited about 428 times in 2019. The citations seem to be both discussions of computational theories of meaning and papers that use LSA for evaluating experimental materials. Thus, although LSA is quite old by NLP standards, it is still influential and used in practice. For example, in a priming experiment, you might try to ensure that the relatedness of primes and targets in different conditions is the same by showing that their similarities in LSA are about equal. LSA is also of interest because some of its proponents specifically argued that it is a model of human knowledge or word meaning (Landauer, 2007), and models using similar techniques continue to be tested against human psychological data (e.g., Mandera ., 2017). Such tests seem to imply that LSA and similar techniques might be a possible model of semantics.

The original LSA model was trained on a corpus of 4.6 million words (Landauer  Dumais, 1997, p. 218)

. To fit LSA, the corpus is divided into sections, called “documents.” Those sections might be articles (e.g., encyclopedia entries, newspaper stories) or simply paragraphs of a longer work. The system tags whether each word occurs in each document, forming a large matrix of words by documents. This matrix is reduced by single-value decomposition (SVD) to produce a low-dimensional vector for each word called a “word embedding”

(see Martin  Berry, 2007, for a detailed explanation). This process has the effect of giving related words similar vectors, because their patterns of co-occurrence were similar in the original matrix. In LSA, it is not merely each word’s co-occurrence with other words that is important, but second-order co-occurrences, such that if two words both co-occur with the same other words, their LSA meanings (vectors) will be similar.

Thus, according to LSA, word meaning is represented through its word embedding. The vector is not interpretable in and of itself but in terms of its relation to other words. Word similarity is measured by calculating the cosine or dot product of the two vectors. Crude sentence representations can be constructed by adding together the vectors of its component words, or other operations analogous to predication (Kintsch, 2001). The system can be applied to various tests, such as choosing synonyms, completing a sentence, or evaluating whether a sentence is a good summary of a passage. Related probabilistic models have been applied to identifying “topics” in documents, such that words are represented by the topics they participate in (Blei ., 2003; Griffiths ., 2007). Finally, Landauer  Dumais (1997) showed that LSA could even score a “passing” grade on part of the TOEFL test of English for non-native speakers.

Models based on text co-occurrence were limited in their ability to identify semantic relations beyond similarity (see below). However, a research program arose to augment such models so that they could serve as the basis for identifying superordinates, synonyms, part-whole relations, and the like. One technique was to start with labeled corpora, in which the part of speech of each word was identified, allowing better identification of relational terms vs. substantive terms. Another technique was to look for specific kinds of patterns that linguistic analysis suggests would indicate a given relation, for example, noun-verb-noun phrases, adjective-noun pairings, possessives, prepositional phrases, and so on (Baroni ., 2010; Baroni, Bernardi  Zamparelli, 2014). Such approaches were fairly successful in identifying specific lexical relations, but they required linguistic sophistication and specific analyses to identify each relation of interest, and so the emphasis in the field seems to have returned to less directed models that rely on much more intensive computation, as we describe next.

After the development of LSA and related models, a different approach arose for deriving meaning from text sources (Mikolov ., 2013; Pennington ., 2014). To distinguish these classes of methods, Baroni, Dinu  Kruszewski (2014) called the earlier approach count models (as they rely on co-occurrence counts of words) and the newer approach predictive models. As the name suggests, these models attempt to learn words by trying to predict the probability of a missing word given its context (alternatively, skip-gram models predict a surrounding context given a word, which we don’t discuss here). For example, say the prompt is, “Chris bit into the juicy       and placed it on the kitchen counter.” A plausible guess would be some kind of food, possibly a fruit like plum or orange.

Figure 1: Model architectures for CBOW (A), RNN (B), and BERT (C). Models (A) and (C) predict a missing word (“toward”) given its context (“She swims MASK

the bank”), while (B) predicts each word in the sentence given the previous words. Light blue boxes indicate word embeddings (vectors), and dark blue boxes indicate hidden embeddings (also vectors) after incorporating context. The hollow arrows in (C) are residual connections.

A popular predictive model is Continuous Bag-of-Words (CBOW; Mikolov ., 2013), which is illustrated in Figure 1A. CBOW has been trained on tremendous corpora; for instance, in this article, we analyze a large-scale CBOW model trained on the Common Crawl corpus of 630 billion words. CBOW learns a word embedding for each word in the corpus (Figure 1A; light blue boxes), which are the analogs of the LSA word embeddings. CBOW takes a context window and computes the average embedding, and then compares this average vector to possible output words using the equations,

The -dimensional embedding for each candidate output word is , and the embedding for each word in the context window is . In essence, the contextual window is summarized by the average of the word embeddings (Figure 1A; dark blue box). Then, the similarities must be computed between each of the candidate words and the contextual summary (via dot product), using a softmax function to normalize these similarities to become the probabilities used for prediction. The word embeddings are the main trainable CBOW model parameters (light blue boxes), and they are learned via gradient ascent to maximize the (approximate) log-probability of the masked word.

Averaging (or summing) word embeddings has also been studied as a means of composition. At a phrasal level, Mikolov . (2013) added together word embeddings to construct phrase representations. For example, “French” + “actress” resulted in a vector most similar to “Juliet Binoche” (a prominent French actress); “Vietnam” + “capital” resulted in “Hanoi.” In other work, Baroni  Zamparelli (2010) studied adjective-noun compounds using LSA embeddings (“bad luck” or “important route”), finding that matrix multiplication was better than additive models at reconstructing the representation of adjective-noun phrases (with the adjective as a matrix and the noun as a vector). It is remarkable that useful phrase representations can be constructed in such simple ways, but building sentence representation is more complicated as we will see next.

4.2 Sentence representations

Sentences specify particular relations among the entities and actions they describe, and comprehension requires recovery of those relations. The angry dog bit a sleeping snake does not mean the same as A sleeping dog bit the angry snake. Jumbled words do not result in any semantic representation, though speakers may be able to figure out what meaning might have possibly been intended: sleeping angry bit snake dog the a. However, if one derives sentence meaning by adding together the vectors of the words in a sentence, one will arrive at the identical representation for all of the above examples, sensible and nonsensical. A model aiming to build sentence representations from word representations would need to have a model of syntax and sentential semantics in order to combine the words to form propositions­—an extraordinarily difficult problem. Computing sentence representations by summing word embeddings, such as the word embeddings learned by LSA or CBOW, is a non-starter for capturing the full richness of sentence meaning.

More sophisticated predictive models—known as language models—build sentence representations using neural networks (e.g., Elman, 1990; Devlin ., 2019; Radford ., 2019). As with CBOW, language models learn representations that are useful for predicting missing words given their surrounding context (Figure 1B and C). Although basic CBOW discards word order (see Mikolov ., 2018, for an extension that uses it), language models use word order to learn meaningful syntactic and semantic structure, to some degree.

Language models are more computationally intensive to train than mere word representations. To jumpstart learning, language models can be initialized with the pre-trained word embeddings from a simpler model (CBOW), which make up the first layer of the language model (Figure 1B and C; light blue boxes). During training, these word embeddings are fine-tuned along with all of the other downstream parameters. With enough data and a sufficient network capacity, the hope is that a model trained to predict missing words will learn syntactic and semantic knowledge about language—at least enough to solve practical NLP problems.

In pioneering work, Jeffrey Elman (1990

) showed that Recurrent Neural Networks (RNNs) can learn meaningful linguistic structure when trained to predict the next word in a sequence (known as autoregressive modeling). As shown in Figure


B, RNNs achieve recurrence by using the previous hidden vectors as additional input when predicting the next word. Through this mechanism, the hidden representation of each word is influenced by the representation of previous words. (The hidden representations are the dark blue boxes above each word in Figure


B. A RNN with two layers is shown.) Elman showed that RNNs trained on simple artificial sentences can show emergent lexical classes—implicit in how the word embeddings cluster—such as nouns, transitive verbs, and intransitive verbs. Subsequent work introduced RNNs with more sophisticated gating and memory mechanisms, such as Long Short-Term Memory

(Hochreiter  Schmidhuber, 1997)

or Gated Recurrent Units

(Cho ., 2014), allowing networks to store and retrieve information over longer time scales. For RNN-based language models, sentences can be summarized and passed to downstream processing through a variety of methods: extracting the last time step’s hidden vector, computing a simple average over hidden vectors across all time steps, or computing a weighted average over hidden vectors using weights determined on-the-fly by a downstream process (known as attention; Bahdanau ., 2015).

A new architecture, the Transformer, has started to dominate the leaderboards for language modeling and other NLP tasks (Vaswani ., 2017). A Transformer architecture is shown in Figure 1C. Transformers are neural networks that operate on sets: a transformer layer takes a set of isolated embeddings as input (Figure 1C; light blue) and produces a set of contextually-informed hidden embeddings (dark blue). Residual connections help preserve the identity of these word representations as they flow through each layer, meaning that each input element (word embedding) corresponds more strongly to one element in the output set (the transformed word embedding). As with RNNs, transformer layers can be stacked for deeper contextualization (Figure 1C shows two layers). Unlike RNNs, Transformers use a “self-attention” mechanism that facilitates direct interaction between each element in the set with all other elements, without relying on an indirect recurrent pathway.

The isolated input embeddings aim, as with CBOW, to represent the meaning of words in isolation (Figure 1C; light blue).222Note while our discussion uses the term “word embeddings” for simplicity, more complex tokenizations based on pieces of words are typically used in large-scale systems (e.g., Sennrich ., 2016). As a first step, these input word embeddings are concatenated with positional information to mark word order. Through each transformer layer, the word embeddings are updated based on the other words in the sentence; for instance, a homonym like “bank” must, in some sense, have a word embedding that captures multiple meanings of the word in isolation (Figure 1C; light blue box above “bank”). When presented in context, such as “She swims toward the bank”, the hidden representation should resolve to mean river bank rather than financial institution (Figure 1C; dark blue boxes above “bank”). We examine this type of contextual resolution in more detail later in the paper.

As with RNN language models, Transformers can be trained as autoregressive models that predict the next word in a sequence, leading to networks that can seamlessly generate text but incorporate context in only a unidirectional manner (left-to-right in English). We examine GPT-2 in this paper, a massive autoregressive Transformer with 1.5 billion parameters

(Radford ., 2019), trained on a corpus of 45 million web pages linked from Reddit. (The new, much larger GPT-3 from Brown ., 2020 was not available for evaluation at the time of writing. We consider the implications of GPT-3 in the General Discussion.) Alternatively, Transformer-based language models can be trained on the Cloze task of predicting randomly masked words given their bidirectional context (as shown in Figure 1C). We examine BERT, a popular language model with 340 million parameters, trained on a corpus of 3.3 billion words that combines Wikipedia and a corpus of books (Devlin ., 2019).

4.3 NLP as a theory of semantics

These large-scale neural networks have been remarkably successful in NLP. They certainly do things that could only have been dreamed of 25 years ago, and they provide help in many tasks such as translation, summarization, question answering, and natural language inference. They have limitations (see below), but they are still a work in progress in a dynamically changing field. They will continue to improve. However, what is their status as a theory of psychological semantics?

The driving force in NLP is the development of more powerful models that accomplish specific tasks rather than hypotheses about semantics. In most cases, NLP papers do not make claims about the relation of their models to psychology or linguistics (there are exceptions, e.g., Baroni ., 2010). Rather, most NLP papers are motivated by applications. Interestingly, the people who have been most likely to claim that these models could be theories of psychological semantics seem to be psychologists. For example, Landauer (2007) explicitly argues that LSA provides a theory of semantics (see Kintsch, 2007, for a more nuanced approach).333Early papers on HAL were more conservative, focusing more on issues such as “capturing information about word meanings” (Lund  Burgess, 1996, p. 206) Others have tested the ability of such models to explain human data, which seems to give credence to the idea that they are psychological models (e.g., Baroni, Dinu  Kruszewski, 2014; Mandera ., 2017; Louwerse, 2007; Marelli ., 2017). That is, the models are tested in the same way one would test a theory of lexical representation. Lewis . (2019)

argued that some semantic knowledge derives from associative learning based on the statistical structure of language rather than instruction and inference, using NLP data to bolster their claim

(see Kim ., 2019, for a response). Whether or not modelers intend their models as psychological accounts, we believe it is important to explicitly outline the challenges of interpreting all these models as psychological theories. (Earlier critiques of this approach within psychology are discussed in Section 7.6.)

To work up to our argument, we first discuss older computational theories that have also been proposed as representations of psychological semantics and whose shortcomings are well known.

5 Early Theories of Psychological Semantics

Perhaps the first computational theory of meaning in psychology was provided by Charles Osgood and his colleagues (Osgood ., 1957). Working within a behaviorist framework, Osgood did not have a vocabulary to talk about mental representations as later researchers would. Therefore, he attempted to operationalize semantics in terms of behavioral measurements, in particular, rating words on adjectival dimensions like fast-slow and happy-sad. Osgood did this for 50 different scales and then submitted the results to a factor analysis, which was able to reduce the data to three orthogonal scales, which he called evaluative (good-bad), potency (strong-weak), and activity (tense-relaxed). Words with similar values on these scales behaved similarly in certain tests.

Other approaches followed some years later, when new techniques of psychological scaling were invented. The creation of multi-dimensional scaling and clustering algorithms allowed researchers to represent the similarity of stimuli in comprehensible terms (Shepard, 1974). Within semantic memory research, Rips . (1973) famously scaled the names of mammals and birds (separately) and showed that these scaling solutions helped to predict categorization difficulty. First people rated the similarity of all the pairs of stimuli. Then these data were then combined and reduced into a low-dimensional spatial representation that simultaneously represented the similarities of all the items at once. The distance between an item and its category name (e.g., bear-mammal, penguin-bird) predicted how long it took subjects to classify items in a sentence evaluation task (e.g., true or false: “All bears are mammals”; “All birds are penguins”). Such scaling solutions can be seen as semantic representations.

Scaling solutions are useful for predicting behaviors that require comparing items to one another (e.g., classification, memory confusions), because these models represent the overall similarity of the scaled items. However, scaling solutions and Osgood’s proposal suffer from the same problem, namely that they do not include the critical information that people must know in order to actually use those words in normal language activities. For example, in the scaling solution for mammals, Rips et al. noted that their solution seemed to indicate two main dimensions: size of the animal and its predacity (i.e., was it a predator or prey?). However, those dimensions do not specifically pick out particular mammals well enough to identify them. That is, which mammal is a mid-sized moderately predacious one? Which is large but very much prey? People know hundreds of animals. In order to know when to use names like sheep, goat, or cow, language users must know what they look like, what they eat, where they live, how they move around, and many other facts. If you see a drawing of a lone sheep, you can immediately label it “sheep,” without the drawing indicating the animal’s size and without seeing it being preyed on. Osgood’s three dimensions also simply don’t tell us what the word means. You could know a word’s evaluation, potency, and activity to three decimal places, but you still wouldn’t know whether the word is a noun or a verb, concrete or abstract, or what semantic domain it was in.

Scaling solutions of this sort are good at representing the similarity of various concepts or words to one another, but they simply do not contain the body of knowledge people have about those things that controls the use of those words. Furthermore, the dimensions discovered in the scaling solutions are typically ones that help to distinguish the stimuli as a whole but often do not include information that is essential to understanding a specific item. Distinguishing sheep from goats might require knowledge of the specific bodily shapes, proportions, and parts of the two creatures, most of which is missing from the low-dimensional space. Scaling solutions do not address the primary desiderata of a theory of semantics (Table 1): describing a perceptually present scenario, explaining what listeners understand when they hear that description, or choosing words based on desires and goals. If you are thinking that you hope to see goats at the farm, perhaps generating a mental image of what you hope to see, that would not match the information in the multidimensional scaling or in Osgood’s dimensions sufficiently to pick out the word “goat” instead of the names of other farm animals.

The low-dimensional scaling solutions are precursors to LSA embeddings and other NLP techniques developed since the 1990s, facilitated by the availability of large text corpora and more powerful computers. Scaling solutions based on human similarity judgments can be laborious to produce, especially for a large number of items; the similarity matrix for N items requires N*(N-1)/2 entries, each of which is an average of human ratings. LSA skips the tedious step of collecting lexical judgments and instead assumes that a lot of information can instead be acquired from the co-occurrence patterns of words in text corpora. We do not question this assumption; as discussed above, LSA has been successful in many ways and has opened the door to more powerful NLP techniques. A vector of 400 values (or larger in recent models) can certainly contain much information. Our central question, however, is whether modern NLP models provide an account of a language’s semantics, as the question is understood in either linguistics or psychology. Do these modern approaches go far enough in closing the gap between early scaling methods and our desiderata for a theory of semantics?

6 Semantic similarity

A basic method for analyzing word embeddings is to look at their nearest neighbors, as reported in a number of articles by model proponents (see below). Word embeddings are the semantic representations that control word use and understanding in these models. Therefore, it is important that similar words have similar vectors, or else the model does not accurately understand the words. It is essential to understand here that “similar words” refers to words that share semantic properties and not some other kind of semantic relation (Hill ., 2015). That is, the words should be from the same semantic domain and hopefully from the same category, with overlapping features. Words that are merely associated or that have some other kind of relation, like part-whole or object-attribute are generally not semantically similar and so are not used in the same way. For example, one of us has trouble distinguishing SUVs from mini-vans and is apt to apply one name to the other kind of vehicle. But he still has a general idea of what these words refer to and can usually be understood even when he makes a mistake with one of them. However, if he instead confused “SUV” with “wheel,” one would have to say that he was very confused about at least one of those words—even though wheels are part of SUVs and have an obvious semantic relationship. SUVs have wheels but are not similar to wheels. A number of models have problems with just this kind of confusion, although we will show that more recent NLP models seem to do much better.

Consider the nearest neighbors of dog reported by Dennis (2007) in the LSA Handbook: barked, dogs, wagging, collie, leash, barking, lassie, kennel, and wag. (Readers may easily test the model with their own words at the LSA website.)444 Of the nearest neighbors, one is an inflected form of dog, four are actions, two are associated things, and two are subordinates. We find this list problematic. The subordinates (collie and lassie) are clearly similar in meaning to dog. However, the actions are not. Actions are from an entirely different semantic domain with different semantic properties, such as whether they are extended in time or punctate, which do not apply to objects. Barking and wagging are certainly actions that dogs do, but these actions should not have highly similar semantic representations to dogs. Similarly, the associated objects like leashes and kennels are obviously related to dogs, but a dog is not similar to a kennel. Dogs are animals that live and breathe and reproduce; kennels are human-made structures made of metal and wood. The properties of dogs are not properties of kennels, and vice versa. The words semantically similar to dog should have been names for other mid-sized, domesticated mammals, like cat, and other canines, like wolf and coyote. Superordinates of dog, like pet and mammal are similar in meaning but are not in the list. See Lund  Burgess (1996) for similar issues with HAL’s neighbors.

LSA, like most NLP models, keeps inflectional and morphologically modified versions of words separate; that is, dog and dogs are two separate words, a consequence of analyzing text rather than linguistic entities such as morphemes and stems. The output doesn’t always correctly identify such words as being highly similar, however. Computed has a similarity value of only .35 to compute

(cosine similarity; retrieved from the LSA website). The word

saddle is more similar to horse than horses is (.91 vs. .83). This could be a matter of insufficient data to fully identify the representations of the less frequent form, but it could be that in fact compute and computed occur in slightly different contexts, and the model is correctly identifying that. The problem is that the two words are synonymous except for tense. Surely they are more similar in meaning than compute is to valuation and inventory, which are rated as more similar. Thus, if the LSA representation is “correct” in distinguishing these two word forms because they do not occur in the same contexts, then it is incorrect in claiming that its vectors represent semantic similarity.555One way to resolve this issue is to recognize that these embeddings reflect many types of relations simultaneously, and that classifiers (or other downstream processing) are needed to distinguish how two words—which are similar via cosine—actually relate to one another (e.g., Roller ., 2014). Still, this implies that LSA alone, and as typically used in psycholinguistics, is insufficient in representing semantic similarity.

More sophisticated models may organize their semantic representations differently, and thus we examined the nearest neighbors of more recent NLP models. We tested a CBOW system trained on a much larger corpus of 630 billion words (fastText implementation; Mikolov ., 2018). The nine nearest neighbors of “dog” according to CBOW are as follows: dogs (0.85 cosine similarity), puppy (0.79), pup (0.77), canine (0.74), pet (0.73), doggie (0.73), beagle (0.72), dachshund (0.72), and cat (0.71).666We used the CBOW implementation trained on Common Crawl provided by the fastText library (Mikolov ., 2018). The nearest neighbors were also lightly filtered to exclude tokens with punctuation like “dog—“. The results seem to depend on the size of the corpus and model details. Mandera . (2017, p. 75) report nearest neighbors of the word elephant for a CBOW model trained on movie subtitles, and the results are much like the LSA results—a mixture of words related in various ways—unlike than the CBOW results we report here. While LSA included actions and objects as close associates, CBOW does not; it only includes domesticated mammals and doesn’t stray into other semantic domains. The nearest neighbors strictly include inflectional and morphological variants (dogs and doggie), subordinates (beagle and dachshund), a superordinate (canine), and other close semantic associates (puppy, pup, pet, and cat). This CBOW appears to be much more successful than LSA in computing semantic similarity—a result that aligns with past work (Baroni, Dinu  Kruszewski, 2014)—although corpus size and selection may be a critical factor.

CBOW may even outperform more sophisticated language models on semantic similarity, since it focuses solely on learning word representations rather than sentence representations. Nevertheless, we examined the word embeddings of two large-scale language models, BERT (Devlin ., 2019) and GPT-2 (Radford ., 2019), as implemented in the Huggingface Transformers library.777We used the largest available models, bert-large-uncased and gpt2-xl, from The nearest neighbors were lightly filtered to exclude repeated instances due to spacing differences, and tokens with punctuation and numbers. As with CBOW, we found semantically coherent neighbors. The word embeddings were extracted from the first layer of both models (the “embedding layer”; Figure 1C; light blue), before the self-attention layers that mix the word representations together. The nine nearest neighbors of “dog” according to BERT are dogs (0.67 cosine similarity), cat (0.44), horse (0.42), animal (0.38), canine (0.37), pig (0.37), puppy (0.37), bulldog (0.37), and hound (0.35). Like with CBOW, all of these neighbors are from the same semantic domain. The details are not always exactly correct; surely canine and puppy are more semantically similar to dog than horse is; pig should not be as similar as bulldog is, and other canine animals are missing. However the list manages to include only inflectional and morphological variants, superordinates, subordinates, and other animals. Similarly, the eleven nearest neighbors for “dog” according to GPT-2 are dogs (0.7), Dog (0.65), canine (0.54), Dogs (0.50), puppy (0.46), cat (0.38), animal (0.37), pet (0.37), horse (0.35), pup (0.35), and puppies (0.35). The list is similar to BERT’s except that GPT-2 is case sensitive. Again, the model may not be picking up some details, as horse appears before other canines.

The more powerful NLP approaches also seem to better capture inflectional and morphological variants. Unlike LSA, the stronger NLP models find that the most similar word to horse is its plural form horses (this pattern is also found for dog, dolphin, knife, bank, etc.). Similarly compute and computed are close neighbors in these models but not in LSA. Thus, these more recent models seem to have escaped some of the shortcomings of the earliest count models, although it still should be mentioned that saddle is more similar to horse than many other mammals are (e.g., Thoroughbreds, mule, donkeys, and greyhound for CBOW; colt, thoroughbred, zebra, and bull for BERT). Overall, the models are now in the right semantic ballpark but have not yet gotten the details right.

A more difficult test of semantic representation involves homonyms, like bank that have multiple meanings, e.g., a river bank or a financial institution. Homonyms are a challenge for word embeddings: These words have multiple meanings but have only one embedding to represent them. (See Kintsch, 2007, for a good discussion of approaches to ambiguity in statistical models with a single meaning for each word.) Transformers, however, are not restricted to isolated word embeddings. They can incorporate context from much larger chunks of text, with hope that the hidden embeddings for bank can be refined appropriately (Figure 1C; dark blue). Others have suggested that Transformers are well-suited to resolve the meaning of homonyms through context (McClelland ., 2019), as any model of psychological semantics must be capable of doing. Here, we test some simple cases on homonym resolution with a Transformer.

To evaluate homonym resolution, first the word “bank” is presented in an ambiguous sentence that is consistent with either meaning, “She sees the bank.” The top-layer hidden embedding of “bank” is extracted (Fig 1D; top dark blue box above “bank”) and compared with the embedding of related top-layer embeddings of other words (treasury, ATM, shore, and beach) given the same framing, e.g., “She sees the treasury” or “She sees the shore.” We can make only targeted comparisons between particular embeddings rather than evaluate all of their nearest neighbors, since evaluating sentences is computationally expensive. The results of these targeted tests are summarized in Table 2. As with the isolated embeddings, BERT sees bank in the ambiguous context as more similar to the financial words treasury and ATM than it does to words related to bodies of water, shore and beach. However, if the framing for bank instead suggests an aquatic meaning, “She swims toward the bank” (Table 2; row 2), the contextualized embedding for bank is now more similar to the embeddings for shore and beach than it is to treasury and ATM. Finally, if the framing for bank more clearly suggests the financial meaning—“She deposits it in the bank”—the words treasury and ATM are again the most similar to bank (Table 2; row 3). Thus, BERT seems promising as an architecture for resolving meaning given context, at least as measured through embedding similarity. It should also be noted that replacing the pronoun “She” in Table 1 to either “He” or “I” gave similar results, as well as using the average top-level embedding (representing the full sentence) instead of extracting just one targeted embedding (for a specific word). For this test of bank disambiguation, we also found that GPT-2 behaved similarly to BERT.

Fillers for “She sees the       .”
Treasury ATM shore beach
“She sees the bank.” 0.62 0.73 0.61 0.60
“She swims toward the bank.” 0.48 0.51 0.74 0.63
“She deposits it at the bank.” 0.55 0.66 0.51 0.53

Note: The two highest rated word completions are in boldface.

Table 2: Cosine similarities between the top-level embedding of “bank” in context (rows) and underlined words in a neutral frame, “She sees the       ” (columns) in the BERT model.
Fillers for “She has the       .”
advantage win iron metal
“She has the lead.” 0.55 0.59 0.54 0.54
“She lifted the lead.” 0.53 0.54 0.58 0.59
“She scored and took the lead.” 0.58 0.67 0.47 0.46

Note: The two highest rated word completions are in boldface.

Table 3: Cosine similarities between the top-level embedding of “lead” in context (rows) and words in a neutral frame, “She has the       ” (columns) in BERT.

To further examine BERT’s abilities, we examined another homonym, “lead,” meaning an advantage versus a type of metal.888Actually, this word is highly polysemous, covering six large pages in the OED. We focus on these two common senses of the word, testing it in sentence contexts that make clear it is a noun. When presented in an ambiguous context (“She has the lead.”), the embedding for lead defaults to a meaning more like advantage or win (Table 3; row 1). However, when presented in a context that evokes interacting with a physical object or substance (“She lifted the lead.”), the embedding for lead is now more similar to iron and metal than to advantage and win (Table 3; row 2). In a context that evokes sports (“She scored and took the lead.”), the embedding for lead is now more similar to advantage and win than it is to the other words (Table 3; row 3). Unlike BERT, we found that GPT-2 preferred the advantage interpretation in all cases. Taken together, these results suggest that BERT can resolve the meaning of homonyms given context, at least in the cases we examined.

We must note that this test is not extremely difficult, as it shows that the word’s representation in context shifts towards the correct meaning rather than actually representing the correct meaning in detail.999A complete model would have to arrive at a specific representation of the word meaning in detail, e.g., a heavy, flexible metal, poisonous to consume, able to block radiation, etc., while rejecting incompatible properties from other noun and verb senses. Moreover, simpler models can also disambiguate some polysemous words through their surrounding words, either by computing the average embedding of the neighbors (Erk, 2012) or by inferring the latent topic (Griffiths ., 2007). It’s notable, though, how natural the Transformer architecture is for disambiguating meanings, without using any auxiliary machinery or training.

Their imperfections notwithstanding, large-scale predictive models appear to be much improved compared to LSA. The NLP models can capture important aspects of semantic similarity, including that nearest neighbors should come from the same semantic domain and that word meanings can be resolved with context. LSA is still widely used in psychological studies, but these examinations and the work of others (Baroni, Dinu  Kruszewski, 2014) speak to replacing LSA with stronger word embeddings from techniques such as CBOW, which are readily available in pre-trained form (see fastText software; Mikolov ., 2018). Next, we discuss more quantitative tests than the nearest neighbor analyses popular in this literature, as well as more direct comparisons with human judgments.

Human Behavioral Judgments and Ratings

Researchers have sometimes compared the judgments of NLP models to those of humans who rated pairs of items on various dimensions. Pennington . (2014) compared GLoVE similarity ratings to human similarity and relatedness ratings, for example, finding that correlations ranged from 0.48 to 0.84 on a variety of datasets. More recently, Mandera . (2017) tested various models on their abilities to predict word similarity and relatedness ratings. The authors note that a major test of lexical access is semantic priming, and so they also employed the models to predict the size of priming effects of word pairs from a large database of priming studies. The results are difficult to summarize, but one important generalization (p. 75) is that the large text corpora were best for vocabulary-type tests (like TOEFL), but smaller corpora based on film scripts and subtitles were quite good for association and priming. In general, predictive models did better than count models, consistent with findings from Baroni, Dinu  Kruszewski (2014). Given that some of these comparisons were performed by psychologists and appeared in psychological journals, this seems to imply that such models provide accounts of psychological semantics. For example, Mandera . (2017, p. 57) describe their project as investigating “the relevance of these models for psycholinguistic theories,” and at the end conclude that “we can unequivocally assert that distributional semantics can successfully explain semantic priming data” (p. 75), although they do not draw specific psychological conclusions.

This leads to an important point, that not all of these tests are providing windows onto word meaning per se. That’s not to say that there is no influence of semantics, but some tests are primarily (or equally) subject to other variables. For example, consider word priming and association. Both of these are strongly affected by frequency and co-occurrence of words (and their referents). Indeed, for a long time it was a controversial question as to whether semantic similarity caused lexical priming at all when the words were not actually associated. In a meta-analysis, Lucas (2000) found semantic similarity does cause priming, but less than association does (about half the effect of association priming). The problem with association is that it is often reliant on contiguity, and although similar things may certainly co-occur in life or in language—e.g., cats and dogs—so do things that are not similar, like cat and meow, cat and tuna, cat and whiskers, or cat and bowl. Knowing that “cat” is related to “meow” and so on does not tell you what a cat is, what it looks like, or what things in the world should be called cat. Thus, capturing word associations in a model does not necessarily mean that one has captured word meaning.

7 Desiderata

So far, we have examined NLP models in terms of semantic similarity and found the results to be promising. However, semantic similarity is only one component that a model of psychological semantics would need to explain. At the beginning of this article, we asked what human behaviors a theory of lexical semantics should explain. We suggested five desiderata (Table 1). Next, we review these desiderata in context of recent achievements in NLP and AI more generally. We conclude that despite the expanding capabilities of modern systems, they are still very far from a plausible account of psychological semantics.

It is important to emphasize that we are not requiring a model of semantics to actually interact with the world in all the ways we discuss below. However, a potential theory of semantics needs representations of objects, properties, relations, and categories and so on that could describe the world if appropriate interfaces were provided. A model can be a potential theory of human semantics, even if it doesn’t have all the elements necessary to actually perform the task under discussion.

7.1 Describe a perceptually present scenario, or understand such a description

We understand that none of the NLP models discussed so far has the interface to actually interact with the world: no cameras, microphones, mechanical hands, etc. More relevantly, their word representations are only meaningful in relation to other words; perceptual features and actions associated with the word’s referent are not represented. For example, the word embedding of knife does not contain information about the shape, parts, colors, and functions of knives. It may be related to words that describe some of those things, say blade and sharp, but those words are not perceptual features and are represented in terms of their relations to still other words, and so on.

Part of our argument about whether such models can be understood as accounts of semantics has to do with whether the word representations can be connected up to actual things in the world (Harnad, 1990)

. Pure text-based models are fundamentally limited in this way and as a result can never be psychologically adequate models of semantics. Fortunately, there is a research area that combines computer vision with NLP models in ways that have greater promise for developing a more realistic model

(see Baroni, 2016, for a review). Kiela . (2016) propose embodiment in virtual environments as a long-term research strategy for AI. Reviewing evidence from cognitive neuroscience, McClelland . (2019) argue that language representations are deeply integrated with multi-sensory perceptual representations, as well as representations of situations and events. They propose that language models should be placed into the context of other modules, of perception, action, memory, etc. Bisk . (2020) describe a roadmap for NLP that incorporates multi-modal, embodiment, and social factors. In this section, we follow previous work in highlighting the multi-modal nature of psychological semantics, while also discussing how its conceptual basis goes beyond current work in multi-modal machine learning.

Figure 2: A neural architecture for caption generation (Xu ., 2015). (A) An input image is processed with a ConvNet encoder, producing a visual embedding for each spatial location (red). The encoder passes these messages to the recurrent decoder (blue), which produces a caption word-by-word. Each decoder step attends to different spatial locations in the input image. (B) Where the decoder is attending when producing words umbrella and ground as outputs.

Psychological semantics supports far more than word similarity. As we have noted, for human speakers the word meaning of knife contains information about the shape, parts, functions, and likely locations of knives, such that when they see or think about something having these properties, they can produce the word knife. A complete model would also understand the uses and implications of the concept (e.g., sharpness, dangerousness). If they hear, “I got the knife, but the blade was dull,” they should understand that the blade is part of the knife, and the blade is supposed to be sharp for carrying out certain functions (slicing, dicing, etc.). Similarly, if they hear, “I got the knife, but the handle was broken,” they should understand that the knife will be more difficult to use but is still potentially dangerous. Finally, if they hear “I didn’t have a knife, so I grabbed my keys to open the packing tape,” they understand that the concept of a knife is characterized, in part, by a functional role that other objects can satisfy in some circumstances. A model must represent the parts and properties of knives in a coherent fashion in order to understand events described in text. Not everything one knows about knives must be included in the lexical representation, but enough must be so that basic sentences can be understood and appropriate inferences drawn.

AI researchers are certainly working on various forms of multi-modal learning. A recent flurry of work has focused on integrating vision and language, leading to creative combinations of computer vision and NLP models. Active research areas include image caption generation (X. Chen ., 2015; Vinyals ., 2014; Xu ., 2015), visual question answering (Johnson, Hariharan, van Der Maaten ., 2017; Agrawal ., 2017; Das ., 2018), visual question asking (Mostafazadeh ., 2016; Rothe ., 2017; Z. Wang  Lake, 2019), zero-shot visual category learning (Lazaridou ., 2015; Xian ., 2017), and instruction following (Hill ., 2020; Ruis ., 2020). The multi-modal nature of these tasks ground the word representations acquired by these models, as we discuss below.

Neural architectures for these tasks typically follow one of two templates. The first template is appropriate for tasks that take visual input and produce language output, such as caption generation or question asking. As shown in Figure 2

, the basic architecture involves two neural networks working together: a visual encoder and a language decoder. These models often start with a pre-trained encoder and, less frequently, a pre-trained decoder. The encoder is a convolutional neural network (ConvNet) pre-trained on an object recognition task, such as ImageNet.

101010ImageNet and all datasets in this paper were used only for non-commercial research purposes, and not for training networks deployed in production or for other commercial uses. The encoder produces a “visual embedding” (in contrast to a “word embedding” as discussed earlier), and this is passed as a message to the decoder. The decoder is a language model that generates text, following the RNN language models discussed previously (Figure 1

B). The language decoder can be trained from scratch, or it can start with pre-trained word embeddings (e.g., CBOW) or a fully pre-trained language model (e.g., GPT-2). After initialization, the encoder and decoder are trained jointly (end-to-end) on the downstream task of interest, such as image captioning, allowing the visual embeddings to link up with the word embeddings in service of solving the task. The encoder can communicate with the decoder by passing a single visual embedding that summarizes the image content

(Vinyals ., 2014). More powerful models pass a set of visual embeddings from the encoder to the decoder, using different embeddings for different spatial locations in the image. Using these localized embeddings, the decoder learns to attend to different parts of the image as it produces words (Xu ., 2015). Impressively, these models can show emergent visual-language alignment; the decoder often attends to the umbrella part of the image when producing the word “umbrella” in the caption (Figure 2), perhaps analogous to human attention when describing scenes (Griffin  Bock, 2000).

Figure 3: A neural architecture for visual question answering. An input image is processed with a ConvNet encoder (red), while a question input is processed with a RNN encoder (blue). Information from both encoders are combined in another network (gray) to produce the answer to the question.

The second template for neural architectures is common in grounded language understanding tasks such as question answering and instruction following. As shown in Figure 3, the architecture has two encoders; a visual encoder processes the image (as in Figure 2) and a language encoder processes the question or instruction. The encoders feed into another neural network that produces an output (e.g., the answer to the question, or the actions to perform a command). Similarly, the challenge is aligning the visual and language embeddings to successfully perform the task of interest.

These multi-modal models can learn useful visual-language alignments, as measured through zero-shot classification performance (Xian ., 2017) and correlational analyses of multi-modal embeddings (Roads  Love, 2020). Through pre-training and fine-tuning, these multi-modal neural networks can absorb truly immense quantities of visual and language experience. In a typical case, the models learn from a million or more labeled images for visual encoder pre-training, billions of words through word embedding pre-training, and hundreds of thousands of image-caption pairs for task fine-tuning over the entire model. Taken together, could multi-modal models trained on these massive datasets provide that necessary connection between words and the world?

Figure 4: Captions generated by a neural network for scenes depicting umbrellas.

Multi-modal models help to provide substance to semantic representations, and this promising research direction will undoubtedly progress further. Nevertheless, as they stand these models are not satisfactory accounts of psychological semantics; learning to associate visual patterns with words is not sufficient to provide semantic knowledge. Returning to the knife, one must ultimately know things like the relation between the handle and the blade, what their names are, what each is used for, where and when a knife typically occurs, the function of a knife, how that function would change with a dull blade or a broken handle, and so on. This information must be abstract enough to generalize across modalities and be integrated with broader knowledge (Murphy  Medin, 1985; Rumelhart, 1978).

Of course, a proponent of NLP systems might propose that in fact such models do include detailed information about the structure of words, since that information is embodied in the huge text corpus that the model is trained on. There may well be sentences in a corpus about a knife being broken or the handle coming off or a dull knife being dangerous, the blade in particular being dull, etc., which together might embody exactly the information needed to recognize and think about knives. If these models are paired with computer vision modules, they might well be able to create componential representations about the properties and parts of knives that are necessary for word use.

Figure 5: Captions generated by a neural network for scenes depicting natural umbrellas.

We use the concept umbrella as an informal test of this idea. We could have chosen a number of concepts to test instead (chair, scissors, etc.), but we felt that umbrella was the fairest case since it is one of the few categories that is well-represented in both the popular ImageNet dataset for object recognition (Deng ., 2009) and the popular MSCOCO dataset for caption generation (Lin ., 2014). A canonical umbrella—a piece of circular canvas stretched over ribs with a handle—is highly recognizable from its visual features alone, due to the high consistency from one umbrella to another and the distinctive visual patterns that separate umbrellas from other classes. We tested both a strong object recognition model, ResNet50 (He ., 2016), and a strong captioning system based on Xu . (2015) that combines visual input with language processing in the way we have described (Figure 2).111111We used the torchvision ResNet50 recognizer and an updated version of Xu . (2015) for caption generation, available at The umbrellas shown in Figure 4 are correctly classified by the object recognition system as umbrella,121212These are ImageNet images of tagged as umbrellas, which may include both training and test images. and CBOW embeddings provide reasonable semantic neighbors based on the label alone, including umbrellas (0.7), parasol (0.59), parapluie (0.57), brolly (0.57), and raincoat (0.54).131313These are the five nearest neighbors that aren’t misspellings or alternative capitalizations. As shown in Figure 4, the image captioning system usually succeeds in identifying salient umbrellas in ImageNet images and generating reasonable captions that mention the umbrella. Considering these apparent successes, do these systems actually understand the meaning of umbrella?

These vision-meets-language models can identify umbrellas of the sort in their training set, but such tests do not require the detailed knowledge of objects that we discussed above, i.e., the function of umbrellas and why they have the structure they do. We ran an “ad hoc umbrella” test to illustrate one such difference. In nature, some animals such as frogs and monkeys occasionally hold leaves, flowers, etc. as natural umbrellas when it is raining. Figure 5 shows a range of natural umbrellas. These photos are striking—even iconic in some cases, such as the orangutan captured by National Geographic—because they are unexpected yet easily recognizable portrayals of an umbrella, or at least the function of an umbrella. In contrast, the state-of-the-art in machine understanding lacks these abstractions. The same ResNet50 recognizer that perfectly identified the umbrellas in Figure 4 does not successfully classify any of the ad hoc umbrellas; in fact, the model ranks the label “umbrella” as quite unlikely compared to other labels. (The mean rank for “umbrella” is 181 of 1000. The highest rank is 42.) For instance, when the ResNet50 make predictions for the bottom row of images in Figure 5, compared to “umbrella” the left image is more likely to be a “candle,” the middle image is more likely to be a “fountain,” and the right image is more likely to be a “stove.”

The image captioning system does not fare much better. Despite generating plausible captions related to umbrellas in Figure 4, the caption system produces largely nonsensical descriptions of these scenes, including “a bird that is sitting on a car” and “a close up of a vase with a flower in it” (Figure 5). The novel combination of a natural object used as an umbrella for an unexpected animal seems to have disrupted the system’s identification of the animals. It does use the word “umbrella” once, but notably the object recognizer did not consider “umbrella” a plausible label for this same photo, highlighting the issues with abstraction and robustness. A person may not spontaneously describe each image in Figure 5 as an animal using a umbrella, but surely they would recognize the category if asked directly.

Human word understanding is no doubt related to pattern recognition, but it is also conceptual and model-based, reflecting our understanding of the world around us

(Lake ., 2017). A computer vision or NLP algorithm learns patterns that distinguish umbrellas from other entities, while a person also learns a simple model of an umbrella that covers its key parts, relations, uses, and ideals (Rumelhart, 1978). For umbrellas and common concepts more generally, these models go on to support a variety of tasks: classifying and generating new examples, understanding parts and relations, inferring hidden properties, forming explanations, creating new yet related concepts, etc. (Lake ., 2015; Murphy, 2002; Solomon ., 1999) and understanding sentences describing all those things. We do not walk through our world labeling the objects we see and hear; instead, we tend to point out objects that are not there but should be, errors, unusual situations, goal situations that we want to take place in the future, and facts that we want our interlocutor to know. In order to generate the words and sentences that communicate these things, our vocabulary must be responsive to our goals and knowledge, and not just perceptual and motor properties of the objects we label. All this is to say that object recognition is one part of the input to our semantic system, but there must also be connections to broader knowledge of the type discussed by research on human concepts (e.g., Gelman, 2003; Keil, 1989; Murphy, 1988).

In sum, recent combinations of computer vision and NLP models have taken important steps towards grounding text-based representations, an essential quality of any model of psychological semantics. These multi-modal models can accurately describe some perceptual scenarios and understand enough to answer certain questions. But they are still very far from understanding words as people do. Further progress may come from improvements in the training data; it’s possible that through training on large-scale video/audio or richer 3D environmental simulations, models will come to develop more complete and useful semantic representations for words. We are doubtful this alone will suffice, given the vast gaps between human and machine conceptual understanding and the issues of abstraction outlined above.

A noteworthy alternative is “neuro-symbolic” modeling that build hybrid models out of both neural network and symbolic components. Most multi-modal models, including the ones we described above, learn direct neural mappings between vision and text. Instead, neuro-symbolic models use intermediate structured representations for bridging between modalities (Johnson, Hariharan, van der Maaten ., 2017; Mao ., 2019; Z. Wang  Lake, 2019), building on formalisms used in semantic parsing (Eisenstein, 2019) and Language of Thought models of human concept learning (Goodman ., 2008, 2015). These intermediate representations offer stronger forms of abstraction, compositional generalization, and links to structured models of concept learning. This is a promising direction for building richer models of psychological semantics, although current approaches still have very simple semantics and aren’t as developed as their direct mapping counterparts. It will be some time before any algorithm can observe the world and form a knowledge structure anything like that of humans, and it is not known now how this might be accomplished.

7.2 Choose words on the basis of internal desires, goals, or plans

In some ways, this issue is the opposite of the one just discussed. People need to be able to produce words based on external inputs (describing a scene or labeling an object) but also based on internal ideas. Standard models of language production (Levelt, 1993) propose that a sentence begins as a thought, a proposition of some kind that the speaker wishes to communicate. Words are matched to components of the thought, and a syntactic frame is selected that will match the structure of the proposition. Then further processes spell out the details of the sentence, the phonetic structure of the words, and the motor sequences involved in producing them. A theory of word meaning must deal with the initial step in this process, the translation of idea to words.

The text-based systems we discuss do not have representations of ideas, per se. That is, the nonlinguistic thoughts that then generate linguistic representations are not present in them; all their representations are in terms of how words relate to other words. Since thought is not (or not solely) word-based (Fodor, 1975; Murphy, 2002), words must connect to concepts more generally. So far, this aspect is missing from the systems we have described.

Of course, there are dialogue systems that can have conversations with users, and these systems vary in the structure and richness of their internal states. Current systems tend to fall into two types (H. Chen ., 2017): 1) goal-directed systems with richer internal states but more limited language skills, or 2) text-driven systems with more sophisticated language skills but more limited internal states. Goal-directed dialog systems often assist a user with specific tasks like making travel plans or restaurant reservations. Such systems do often have knowledge of their limited domains (e.g., flight schedules and costs, travel rules, typical preferences, etc.), and they choose their words on the basis of achieving goals and satisfying the user. In a theoretical sense, these systems may be considered more successful as semantic systems, in that their words connect to actual entities, actions, and events in the world. If a customer says, “Let’s book a morning flight from New York to Miami,” the dialogue system may first translate the natural language utterance into a formal description specifying entities and relations through the process of semantic parsing (Eisenstein, 2019). Ultimately, a ticket may be sold involving a flight that actually will go from New York to Miami.

Goal-directed systems are a step forward in imbuing word representations with meaningful semantics; unfortunately, in other ways, these systems do not have sophisticated word representations. For language production, these systems typically serve up canned chunks of text (even more recent neural-network-based dialogue systems do so too, Bordes ., 2017), prioritizing the usefulness of the interaction rather than understanding the words it is saying. They lack the graded word representations that capture useful aspects of word similarity, as discussed earlier. For language understanding, many of these systems rely on hand-crafted features that map specific words (flight, connection, layover, price, tax, round-trip, etc.) onto internal entities (Young ., 2013), although other systems are beginning to integrate neural language understanding (Bordes ., 2017). By strictly limiting the world in which such a model operates, it is able to manipulate real events and objects within that world. However, such models lack the linguistic flexibility and sophistication that allows people to name novel objects, make and understand novel combinations, and produce novel thoughts. An airline reservation system can learn about your specific travel plans, but it couldn’t learn a new fact about air travel through verbal communication.

If the world could be hand-coded into a representation like that of an airline reservation system, that could serve as the semantic basis for a communication system. The problem is that hand-coding the world is enormously difficult, so the technique of most NLP programs is to attempt to develop a system that will learn on its own from existing data. So far, those attempts have not resulted in structured knowledge of the sort one can create in a hand-designed system.

The second type of dialog system is more akin to large-scale language models; they are typically broader in scope and trained on very large corpora of text-based dialog (Serban ., 2016; Sordoni ., 2015). Although their language skills are much improved, these so-called “chit chat” systems are characterized by their undirected and reactionary nature. These systems are trained to react to user utterances in statistically appropriate ways, rather than formulating words based on internal desires, goals, and plans. If these systems have analogs to these internal states, it’s only in the most implicit sense. Standard models do not have inductive biases that encourage the formation and use of goals and desires, although one could develop architectures that encourage these types of representations. We see this as an important research direction for improving the psychological plausibility and behavioral capabilities of computational models.

There are ongoing efforts focused on addressing some of these shortcoming. Models can be conditioned on “persona” embeddings to encourage consistency in personality and goals, although again in a highly implicit sense (Li ., 2016; Zhang ., 2018). Additionally, these systems are rarely grounded in the ways discussed in the first desideratum, but new efforts have focused on more grounded forms of dialog through curriculum-driven language learning (Mikolov ., 2016), text-based adventure games (Urbanek ., 2019) and discussions of natural images (Shuster ., 2020). Indeed these directions may expand machine capabilities, but as it stands, humans are unique in choosing words on the basis of genuine ideas, desires, and plans. This ability is thanks to the deep links between their internal states and the meaning of words.

7.3 Produce and Understand Novel Conceptual Combinations

Language has been characterized as making infinite use of finite means, allowing familiar elements to combine productively to make new meanings. Our paper focuses on word meaning rather than language more generally, so we cannot discuss all the ways in which language is compositional (see Gershman  Tenenbaum, 2015, for failures of NLP models on simple relational phrases). Nevertheless, the compositional nature of language constrains models of word representation; any model of word meaning that does not permit compositionality would be a non-starter from the perspective of psychological semantics. To focus on semantic rather than syntactic composition, this section considers conceptual combinations: two-word noun phrases that include adjective-noun (e.g., “dirty bowl”) and noun-noun phrases (e.g., “apartment dog”). Speakers produce novel compositions during conversation, and listeners can understand these compositions (Wisniewski, 1997). A model of psychological semantics must do the same.

Do modern NLP systems provide a basis for understanding combinations? Within the distributional semantics tradition, there have been attempts to create vectors for phrases like “apartment dog” out of the vectors of the two words. These attempts have had some successes in accounting for various human judgments (Günther  Marelli, 2016; Marelli ., 2017; Vecchi ., 2016), but they have the shortcoming that the resulting representations, like the underlying lexical representations, are inscrutable. That is, there is no easy way to find out how the model is interpreting “apartment dog.” Therefore, we will look to sentence-based models that can be queried as to what they believe about the combinations.

As we suggested above, it is possible that through their processing of billions of sentences, models have in some sense learned the knowledge necessary to process combinations (at least in terms of verbal relations). That is, perhaps the model has learned what goes in bowls, when they are washed, what makes things sticky, etc., through thousands of sentences that refer to such things. Thus, perhaps they can understand conceptual combinations like “dirty bowls are sticky” (see below) fairly well. To address this question, we first review some of the work on conceptual combination in cognitive science and NLP. Second, we compare people and modern Transformers on a set of complex concepts that have been tested on humans in past psychological studies, finding that current large-scale predictive modeling does not reach human levels of competence.

Any theory of concepts must ultimately account for conceptual combination, and a variety of ideas have been put forward in the cognitive science literature about how to do that (Murphy, 2002, Chapter 12; Wisniewski, 1997). One can construct a fairly simple theory of adjective-noun composition that focuses on feature weighting and adjustment (Smith ., 1988). For the combination purple jacket, the head jacket is a prototype in feature space and the adjective purple changes a particular feature of that prototype, using a self-contained process that depends only on this pair of word representations. However, this account cannot explain noun-noun compounds, which are extremely common in English, and many modifiers don’t seem to adjust the same features when combined with different heads, e.g., ocean view, ocean book, ocean wave, etc. (Murphy, 1988). Furthermore, although a purple jacket is purple, an ocean book is not “ocean.” These complexities suggest instead that conceptual combination requires active interpretation in ways that rely on background knowledge, using knowledge to decide which features to modify and how they should be adjusted. For instance, empty stores typically lose money, but neither stores nor empty things typically do so. We infer this through knowledge that stores make money by selling products, customers purchase products, empty stores have no customers, etc., again revealing the close connection between conceptual knowledge and language understanding. Although knowledge-based accounts are difficult to formalize, they are necessary to account for the sophistication of human conceptual representations and the mental chemistry by which they combine.

Research in NLP has also examined how word embeddings might be combined to form complex concepts. When introducing word representations above, we discussed averaging word embeddings (or summing) as a means of composition in NLP, e.g., “Vietnam” + “capital” results in a vector similar to that of “Hanoi” (Mikolov ., 2013). We quickly dismissed averaging as a means of constructing sentence representations, due to the loss of syntactic and relational structure, but these issues are less damaging for two-word compounds. Nevertheless, many of the cases discussed above pose challenges to additive models, namely adjectives that affect different head nouns in different ways (see Murphy  Andrew, 1993). For example, an ocean view is a view where one can see the ocean, whereas an ocean wave is a wave of the ocean. It’s difficult to see how adding the same exact ocean vector to each head noun could produce such a wide range of semantic transformations.

More capable models use matrix-vector multiplication with adjectives as matrices and nouns as vectors (Baroni  Zamparelli, 2010)

. Linear transformations can account for more subtle transformations, including adjectives that emphasize one property for some types of nouns and another property for other types of nouns

(Baroni, Bernardi  Zamparelli, 2014). Still, linear transformations are quite limited; properties that emerge through active interpretation and background knowledge remain a clear challenge (e.g., “empty stores lose money”). Instead, large-scale Transformers allow for essentially arbitrary transformations, at least in principle, and thus we compare this model class with human judgments concerning conceptual combination.

In a series of human behavioral studies, Murphy (1988) studied the role of background knowledge in conceptual combination by collecting ratings of whether certain features are more typical of compounds or of their constituent parts. He reasoned that if conceptual combination is a closed operation—and the features of compounds can be computed locally from the features of its constituents—compounds should not have properties that are not found in their constituents (compounds should not have emergent features). He constructed 18 adjective-noun compounds paired with properties that he hypothesized would violate this constraint, in that forming the conceptual combination requires background knowledge. Human participants were asked to rate how typical a feature (e.g., “loses money”) is of a category; some participants made these ratings with respect to the compounds (e.g., empty stores), others with respect to the adjective (empty), and others with respect to the noun (stores). Murphy found that for 15 of the 18 items, the candidate property was rated as more typical of the adjective-noun compound than either the adjective or bare noun. We show these 15 items in Table 4. For instance, sliced apples are typically “cooked in a pie,” but neither apples nor sliced things are as typically associated with that property. Dirty bowls are typically “sticky,” but neither bowls nor dirty things are typically sticky.

NLP systems cannot be directly queried about how typical a property is of an object, yet as probabilistic language models, they implicitly associate objects and properties. These associations can be probed in various ways; we chose the conditional probability

—as estimated by auto-regressive models such as GPT-2—to be a straightforward means of extracting these associations. (BERT cannot be easily evaluated since it predicts only the probability of individual missing tokens.) To measure the association between a noun phrase and “lose money,” GPT-2 was queried using an association score,

, such that can be filled by “Empty stores,” “Crowded stores,” “Stores,” “Regular stores,” or “Empty things.”141414Periods were always included at the end of the sentences in the evaluations.

Although the raw probabilities are uninterpretable, their relative rankings are informative: how much more likely is the probe “Empty stores lose money” compared to “Regular stores lose money”? The scores for different objects can be meaningfully compared because this method controls for differences in the noun phrases due to a variety of factors: frequencies, lengths, and prior probabilities. This control is possible because the left-hand-side of the conditional is the same in each case, and the different noun phrases appear only on the right-hand-side.

Using this methodology, we evaluate Murphy’s 15 items using the GPT-2 language model. Recall that in the behavioral study, people were evaluated on the association between the property and the complex concept “Empty stores,” compared to the property and the bare noun (“Stores”) and the bare adjective (operationalized as “Empty things”). GPT-2 was evaluated on these three associations too. For additional rigor, we also evaluated GPT-2 on an alternative phrasing of the bare noun (“Regular stores”) and an inconsistent adjective-noun pair (“Crowded stores”). A prediction was considered successful (and in alignment with the human judgments) if the property was most typical of the consistent adjective-noun pair, e.g., if “Empty stores lose money” had the highest score.

As summarized in Table 4, GPT-2 makes successful predictions for only 7 of 15 items. Notable errors include judging “are rusty” as more typical of “new saws” than “ancient saws” and judging “are sticky” as more typical of “clean bowls” than “dirty bowls.” GPT-2 also judges “losing money” as more typical of “regular stores” than “empty stores.” Despite obvious errors, it did get some cases right which seemingly require background knowledge, knowing that “are cooked in a pie” is more typical of “sliced apples” than “whole apples.” However it may be exploiting word repetition to get two other cases right: “Russian novels are written in Russian” and “Green bicycles are painted green.”

To see if adding context influences the results, we re-ran each case with an additional sentence before the main sentence of interest. (The sentence did not directly refer to the tested feature.) For instance, GPT-2 scored the following multi-sentence utterance, “Stores are still where most products are purchased. Empty stores lose money.” As before, the only phrase to appear on the left-hand-side of the conditional is “lose money” while everything else appears on the right-hand-side. The additional context did not seem to help, as the model largely got the same set of cases right (7 of 15; Table 4).

Large-scale predictive modeling of text does not explain human judgments of complex concepts, at least through the methods that we used. Also, the items from Murphy (1988) should not be taken as the final word on understanding complex concepts. Due to the massive scale of training, predictive text models like GPT-2 will be familiar with many of these complex concepts already (empty stores, sliced apples, etc.)—even if it doesn’t understand them fully. In contrast, people can generate and understand genuinely new compositions, say a cactus pig or snow soda (Wisniewski, 1997). We hope that the NLP community will take up the challenge, ideally informed by human behavioral data and in partnership with cognitive scientists. We see the challenge of understanding complex concepts, and the role of background knowledge in interpreting these compositions, as key to understanding words as people do.

Most typical part
Combination Property Human judgments GPT-2 (no context) GPT-2 (context)
Sliced apples are cooked in a pie. Sliced apples Sliced apples Sliced apples
Casual shirts are pulled over your head. Casual shirts Formal shirts Casual things
Small couches seat only 2 people. Small couches Small couches Small couches
Uncaged canaries live in South America. Uncaged canaries Canaries Canaries
Round tables are used at conferences. Round tables Round tables Round tables
Standing ostriches are calm. Standing ostriches Standing things Ostriches
Unshelled peas are long. Unshelled peas Unshelled things Regular peas
Yellow jackets are worn by fishermen. Yellow jackets Regular jackets Yellow jackets
Green bicycles are painted green. Green bicycles Green bicycles Green bicycles
Overturned chairs are on a table. Overturned chairs Overturned chairs Overturned chairs
Short pants expose knees. Short pants Short pants Short pants
Ancient saws are rusty. Ancient saws New saws Saws
Russian novels are written in Russian. Russian novels Russian novels Russian things
Empty stores lose money Empty stores Regular stores Regular stores
Dirty bowls are sticky. Dirty bowls Dirty things Dirty things

Note: Bold indicates a match between human and model judgments.

Table 4: Comparing humans and GPT-2 on properties of complex concepts. Most typical part shows which noun phrase (e.g., Empty stores, Crowded stores, Stores, Regular stores, or Empty things) participants (Murphy, 1988) or models judged the indicated property to be most typical.

7.4 Change One’s Beliefs About the World

A theme so far has been how background knowledge influences one’s understanding of words. In this section, we consider how influence flows the other way: how words change one’s beliefs about the world. People extract propositions from language that can change their beliefs: “Sharks are fish but dolphins are mammals”; “Knives should be sharpened every few weeks for best performance”; or “Most umbrellas cannot withstand winds of 20 miles per hour.” Learning through words, as opposed to direct experience, is a distinctive feature of human intelligence that facilitates cultural learning and the accumulation of knowledge (Tomasello ., 1993).

Machines can also learn through text that is disconnected from experience. As discussed in Section 7.1, models with access only to text must learn everything in this way. Predictive models such as BERT, GPT-2, CBOW, etc. are not incentivized to learn about the world per se, only to learn about how other words in the context predict a target word. However, as we discussed above, one could conceive of this information as approximating knowledge of the world (though it is only knowledge of words) in the sense that complex relations between various elements are abstracted as statistical patterns, which correspond to the patterns of entities in the world. If we think of a model’s representations as being its beliefs about the world, we can ask whether such beliefs are reasonable and also whether they can be readily added to and changed.

Changes in a model’s lexical representations should be reflected, in some way, through changes in its knowledge about the world. For instance, a language model is tasked with predicting a missing word in a novel sentence, “Sharks are fish but dolphins are       ”. A model that knows little about dolphins might predict fish with high probability, and then receive feedback that the correct answer was actually mammals

. Through backpropagation, the model makes small adjustments to increase the probability of outputting

mammals when this sentence is encountered again. These adjustments will change the input layer’s word embeddings (for fish, sharks, dolphins, etc.), the output layer’s word representations (mammals, fish, etc.), and many other internal parameters. After sufficient presentations of the sentence, the network will learn to produce the right answer, hopefully in a way that generalizes to related sentences. But is such learning sufficient to develop correct beliefs about dolphins, mammals, and their properties?

It’s not entirely clear how to evaluate the beliefs of a language model. Although other methods are possible, we focused our analysis on prompts with missing words, testing models as closely as possible to the way they were trained. Some predictions will certainly be correct; when probed with “Sharks are fish but dolphins are       ”,151515As in previous tests, periods were used at the end of the sentences, and BERT’s special tokens were added at the beginning and end of the sentences. a trained BERT predicts that mammals (probability 0.027) is more likely than birds (0.013) and fish (0.010). However, BERT’s predictions are sensitive to small changes in wording—a property of current NLP systems that has been observed elsewhere (Jia  Liang, 2017; Marcus  Davis, 2019). When probed with “Sharks are fish and dolphins are       ” (swapping but with and), BERT now predicts that birds (0.14) is more likely than mammals (0.031) or fish (0.025). Similarly when probed with “Sharks are fish while dolphins are       ”, BERT predicts birds again. If probed more directly as “Dolphins are       ”, BERT now properly predicts mammals (0.0024) over fish (0.0014) and birds (0.00031). But when asked the same question about the singular noun, “A dolphin is a       ,” it now predicts fish. Examining the word embeddings doesn’t clear up story. For BERT, dolphin is more similar to mammal (the cosine is 0.40) than it is to fish (0.32) and bird (0.33). However, dolphin is more similar to fishes (0.45) than it is similar to either mammal or fish (or mammals or birds). It seems that BERT has some ideas about what dolphins are, but it is too tied to specific wording to be credited with general knowledge.

In a more systematic comparison, we evaluated knowledge of 31 animals in BERT and GPT-2 using the same framing as before, “A squirrel is a       ” or “Squirrels are       ”.161616The list of animals in Kemp  Tenenbaum (2008) was used after removing the two “reptiles”, since this word isn’t a single token in BERT. We didn’t alternate “a” vs. “an” in the questions, which could help reveal the answer. We considered four possible superordinate categories as answers (bird/birds, insect/insects, mammal/mammals, and fish/fish; using the singular or plural form of the category depending on the question). BERT predicts the right answer for only 54.8% of the questions in singular form, and for 77.4% of the questions in plural form. Strangely, the two forms of the same question yielded inconsistent answers more often (51.6%) than consistent answers. Notably, it predicts squirrels and horses are mammals, but a squirrel or a horse is a bird. Whales are mammals, but a whale is a fish. Butterflies are birds, and so is a butterfly. GPT-2 fared better with 83.9% accuracy on singular forms and 77.4% for plural forms. Still, there was a striking inconsistency between the two ways of asking the same question, with 35.5% mismatches in the answers.

Taken together, it’s unclear if current language models hold any genuine and consistent beliefs about taxonomic relations. This uncertainty persists despite training on billions of words that include, presumably, the entire Wikipedia entries for dolphins, mammals, fish, etc. This muddiness is the hallmark of a primarily pattern-recognition driven learning process. During training, BERT hones its abilities at predicting missing words and the order of sentences, acquiring some inkling of how the word dolphins is related to mammals and fish, but nothing seemingly explicit or belief-like. As a result, it is ineffective at changing its beliefs or building a coherent world model based on systems of beliefs (Lake ., 2017).

The challenges of representing and changing beliefs extend far beyond just taxonomic categories, and text generation provides another window into what, if anything, language models believe. Auto-regressive models such as GPT-2 can generate impressive passages of text, although these models frequently contradict themselves. In a highlighted demonstration of GPT-2’s text generation capabilities (Radford ., 2019), the model was tasked with reading a fanciful prompt concerning talking unicorns and producing a reasonable continuation:

Prompt: In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.
GPT-2’s continuation

: The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science. Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved…

GPT-2 impressively generates several more paragraphs of seemingly natural text (Radford ., 2019, pg. 13). The model understood enough about this bizarre scenario—unlikely to be in its training corpus—to write fluently about it. Nevertheless, a closer look is revealing about the structure of GPT-2’s beliefs, in this particular case, as related to unicorns. In the generated passage, GPT-2 contradicts itself almost immediately, writing about “these four-horned, silver-white unicorns.” The model, evidently, doesn’t understand that unicorns must have one horn. Also, in two adjacent sentences, it suggests that unicorns have one horn (“their distinctive horn”) while simultaneously having multiple horns (“four-horned, silver-white unicorns”). Even if you could help correct GPT-2, as you would a person, by specifying that “unicorns must have just one horn,” it’s doubtful that GPT-2 would get the message. Indeed, it seems hard to believe that similar sentences were not already in its learning corpus, given that it knows something about unicorns. It’s also not clear how one would open up the model and add/correct this fact, given that GPT-2 uses 1.5 billion learnable parameters to make predictions.

Language models can answer some difficult factoid-style questions, although they are hardly a reliable source. Using a corpus called Natural Questions (Kwiatkowski ., 2019), GPT-2 was evaluated on factoids such as “Where is the bowling hall of fame located?” or “How many episodes in season 2 breaking bad?” GPT-2 answers 4.1% of these somewhat obscure questions correctly (Radford ., 2019); performance is higher in the larger-scaler GPT-3 (15-30%; Brown ., 2020) and can also be substantially improved when combined with techniques form information retrieval (Alberti ., 2019), e.g., providing BERT with all of Wikipedia and allowing it to answer by highlighting passages. Other studies (Petroni ., 2019) have found that BERT makes reasonable predictions for more commonplace questions from the ConceptNet knowledge base, including “Ravens can       ” (prediction: fly) and “You are likely to find a overflow in a       ” (prediction: drain), although these predictions are brittle in all the ways outlined above. This suggests that current methods of extracting information from text corpora have not yet formed knowledge bases that would be sufficient for conceptual combination and language understanding more generally, although techniques continue to improve.

Computational theories of semantic memory, from the very beginning, have recognized the need to specify the relations between words (or concepts) in order to provide coherent, accurate representations. That is, models must know that a dolphin ISA mammal and CAN swim, or else it will not be able to correctly draw inferences. Unlabeled links from dolphin to mammal and swims are not sufficient (e.g., Brachman, 1979; Collins  Quillian, 1968). Word representations need to represent which words are properties, parts, synonyms, objects acted on, or things that simply tend to co-occur with the word’s referent. Earlier neural network models attempted to capture this type of semantic knowledge and in particular, the developmental process of acquiring semantic knowledge (Rogers  McClelland, 2004). The limitation of the approach was not theoretical, but practical: The modelers hand-coded the features and category names and their relations. Thus, the model was told that a canary is colored yellow, not just that canary and yellow go together. The argument for this is that human children get these relations through both perception (yellow is a color, which the canary-learner already knows) and language (parental input). Thus, specifying these relations is not a deus ex machina merely designed to make the model work. The latest NLP models we discuss do not attempt to learn explicit relations; they rely purely on text predictability to provide all the information.

Two things need to be done in order to scale up the Rogers  McClelland (2004) approach. First, perception must ground relations between properties and objects that are only implicit in language, using techniques in computer vision (Desideratum 1). Second, to avoid hand-coding, a neural network should be developed that can extract relations between words (ISA, CAN, etc.) from text even if it does not explicitly include labeled relations. As discussed in Section 4, structured distributional models aim to do exactly this by looking for specific word patterns (Baroni ., 2010; Baroni, Bernardi  Zamparelli, 2014). The system could also use a knowledge base to encourage explicitness and consistency in its belief system, relating to current efforts that combine neural networks with knowledge bases (Bosselut ., 2020). To be fully successful, however, such a hybrid model would need to be able to use words to change its beliefs (such as “unicorns have only one horn”), as opposed to merely accessing fixed beliefs, further highlighting the complexity of psychological semantics and the abilities it supports.

7.5 Respond to Instructions Appropriately

The last desideratum concerns turning words into actions. None of the models discussed so far can respond to instructions, beyond simply generating more text. We do not fault them for this; our discussions have focused on models with sophisticated word representations rather than models for control problems. Nevertheless, just as understanding words as people do requires more than hooking up a language model to a camera or a computer vision system, simply attaching a robotic (or virtual) body to an NLP system and teaching it some instructions will not suffice. Suppose we take such a language-action hybrid system and ask it to follow a new instruction, “Pick up the knife.” The word embedding for knife (or the sentence representation as a whole) would need to convey a lot of information about the kind of object a knife is, in order for motor commands to properly operate on it. Although the embedding’s nearest neighbors indicate related objects, parts, and functions (dagger, blade, sword, and slicing according to BERT), the representation for knife must contain enough structured information to reliably pick up the knife rather than the sword, or grab the handle rather than the blade. It must also know how to combine familiar words in new ways, understanding the difference between novel utterances such as “Pick up the knife carefully” versus “Pick up the knife quickly.” In fact, after telling GPT-2 to “Pick up the knife”, it offered us this questionable continuation of the text: “If the blade is still on, place it in your pocket.”

Obviously, one should not expect text-based models to understand and perform actions. Text-based models need to be combined with other classes of models, as recent work focused on multi-modal models for instruction following does (Hill ., 2017, 2020; Ruis ., 2020). These systems typically follow the architectural blueprint discussed earlier (Figure 2), using a visual encoder to process visual input (a 3D room or 2D grid with objects in it) and a language encoder to process instructions (“Walk to the red circle” or “Put the picture frame on the bed”). As output, the network produces actions aimed at successfully completing the target instruction. After extensive training—often millions or billions of steps—these models typically understand enough about their subset of language, action space, and environment to successfully complete basic instructions. But do these models understand the words that they act upon? Are their representations of the words flexible enough that they can follow instructions with novel combinations that did not exist in the training set?

In many cases, instruction-following models can make meaningful generalizations. After learning to find various types of objects, models can often generalize to novel object and color pairs, successfully “finding the fork” when the fork is presented in a new color (Hill ., 2017; but see also Ruis ., 2020). After learning how to “find” or “lift” across many scenarios consisting of 30 different objects, models can generalize successfully to “lifting” objects that they only had experience with “finding” (or vice versa) (Hill ., 2020). After learning that a certain class of heavy objects require more actions to successfully “push” them, current models generalize that these same objects require more actions when “pulling,” given experience pulling other classes of heavy objects (Ruis ., 2020).

The same classes of models, however, struggle with many other aspects of word meaning. For instance, models struggle to learn abstract, composable meanings for the word not, failing to “Find not the ball” even after learning about not in many training episodes with other types of objects. After learning not with respect to dozens of object types, generalization to new objects is below 50% correct in a 2D grid world (20 training object types), and somewhere between 60% and 80% in a richer 3D environment (Hill ., 2020). The model’s semantic representation of not is insufficiently abstract and too grounded in the particulars of its training experience.

Figure 6: Instruction following in the gSCAN benchmark. (A) Generalizing from calling an object “big” to calling it “small.” (B) Generalizing to walking to targets in the south west, after learning to find targets in all other directions. Modified with permission from Ruis . (2020).

Standard models often fail to acquire abstract meanings of other types of words, such as actions. The learned concept for move is too tied to the training experience, e.g., “Move to the red square” as implemented in a simple grid world (Figure 6B). In the recent Grounded SCAN benchmark, Ruis . showed that you can train an agent to expertly move to targets positioned due south, due west, or anywhere except to the southwest of the agent’s current position (which is held out for testing). At test, the agent fails catastrophically when moving to a target to its southwest (0% correct). The agent often moves the appropriate number of steps west, or the appropriate number of steps south, but cannot seem to do both together to actually reach the target. Based on its attention maps, it seems to know where to go, just not how to get there.

The same model fails to learn abstract meanings for relational words, including small and large, that depend on the environmental context, e.g., “Move to the large red circle.” Suppose that the agent is familiar with referring to a mid-sized circle as either “the large circle” or “the large red circle,” etc. During training, mid-sized circles are only presented in conjunction with smaller circles, so the modifier large is always appropriate (Figure 6A). At test, this particular circle is paired with larger circles for the first time, and the agent is asked to “move to the small circle” in reference to the medium-sized circle. An agent with a genuine understanding of small should have no trouble, but instead it breaks down and performs no better than picking between circles at random (Ruis ., 2020). Similarly, agents struggle to learn abstract meanings for adverbs such as spinning or zigzagging, when tasked with instructions such as “Move to the red square while spinning.” Although agents are great at spinning and zigzagging for learned instructions, they are abysmal when required to do these things in novel scenarios.

In sum, instruction-following models have a long way to go before understanding words as a person does. Current neural network models rely too much on pattern recognition, learning to identify high-value states or mapping chunks of instruction to chunks of action, without sufficiently grappling with more abstract forms of meaning (Lake ., 2017). Further, the challenge of responding to instructions is compounded by many of the other challenges discussed already, including connecting words to the physical world (Desideratum 1), to goals and desires (Desideratum 2), and to other words (Desideratum 3).

7.6 Summary of Desiderata

Our critique is not that NLP researchers have failed to provide us with robots that move and converse about the world about them. From the perspective of psychological semantics, our critique is that the current word representations are too strongly linked to complex text-based patterns, and too weakly linked to world knowledge. Multi-modal models enrich these word representations by grounding them in vision and action, yet these word representations are too grounded in the particular of their previous experiences. More abstract semantic representations that connect language to knowledge of the world are needed to understand and operate on the world as people do.

Successes in NLP

This raises the question of why deep learning for NLP works so well on many important problems. A full accounting of the remarkable successes of deep learning is beyond the scope of this paper; they have also been discussed and analyzed at length in many places (LeCun ., 2015; Schmidhuber, 2015). The reemergence of neural networks in the last decade was catalyzed by successes on quintessential pattern recognition problems, particularly object recognition (Krizhevsky ., 2012) and speech recognition (Graves ., 2013; Hannun ., 2014), by learning features from raw data that were previously hand-designed. It’s natural to think that this approach would make advances in NLP as well, especially when combined with innovations in architecture (Hochreiter  Schmidhuber, 1997; Vaswani ., 2017) and large datasets and computing resources. These successes are amplified by taking pre-trained word embeddings, or whole language models, and fine tuning them to perform particular tasks, as is the typical current approach for tackling NLP benchmarks (A. Wang ., 2019; Devlin ., 2019). Modern NLP systems, following this model, know a surprising amount about syntax, semantics, and even enough to answer some basic questions about the world. As discussed in Section 6, word embeddings know enough about the relationships between words to predict human similarity judgments and related tasks. The result of this is that the models can do well when they are given words as inputs and words as output, learning the right sorts of associations and patterns through massive amounts of training. The surprising bits of knowledge that models learn presumably come from this process.

For example, consider the world capitals that Mikolov . (2013) tested models on. The models don’t have a list of countries and their capitals; they don’t know what a country or a capital is, which would be required to use the word correctly in conversation or writing. However, it seems likely that the words Paris, capital, and France co-occur fairly often and more often than Lyon, capital, and France or Paris, capital, and Argentina do. Verbs and their past tenses or inflections also co-occur, as the same action is talked about in infinitives, and in different temporal contexts. For example, it would not be surprising if a text about surfing contained sentences like, “Stephanie wanted to surf…,” “She had surfed before…,” and “Surfing is dangerous when…” The things that one says about surfing one might say in talking about past, present, and future contexts, so that the representations of different forms of the verb could become similar. The same is true for nouns and their plurals. These models don’t know what a plural is, but they can learn that knife and knives occur in the same kinds of passages, and so they are assigned similar representations. To a limited extent, language models can even track long distance syntactic dependencies, knowing it’s proper to say “the knives in the drawer cut” rather than “the knives in the drawer cuts(Linzen ., 2016). However, based on text distributions alone, the models don’t necessarily learn that knife refers to one thing and knives to multiple things. The models don’t know that there are things. But without a representation of what a knife actually is, it cannot form a semantic representation of the sort that people have.

There is a lot of text in the world, more than some of us realized before we began to read about corpora of 630 billion words. Finding relations among textual entities can therefore be extremely useful. Furthermore, when people read the outputs of such models, they can fill in the semantic gaps themselves to understand what the model has found. We argued at the beginning that words ultimately gain their meaningfulness by connecting to the world. Humans can provide that connection, when the model produces textual output and the human connects the text to the actual world. If a particular word has the following LSA neighbors, sculptured, sculptor, sculpture, Acropolis, colonnade, Athena, Parthenon, and gymnasiums, readers can readily figure out that this word must have something to do with artwork common in ancient Greece, found in temples, etc. But this is the human interpretation of the word, not something the model has told us. When the human and the model work together, they may be able to interpret unknown words in a way that goes beyond the model’s own performance. We suspect that this is part of the reason why researchers have taken these models (especially the early count-based ones) seriously as theories of word meaning. When the researchers (or anyone) read the list of nearest neighbors, they are identifying the links to the target word by using their own knowledge to fill in gaps and infer the underlying meaning. Models, lacking that knowledge and inference ability, cannot do so. (The word with the above neighbors was statue.)

We now see the research frontier shifting from problems of pattern recognition to problems in reasoning, compositionality, concept learning, multi-modal learning, etc. This is cause for optimism. Further progress will come from improvements in the training data, but this is unlikely to be enough. A new generation of NLP systems, developed with the five desiderata in mind, would look quite different from today’s systems, while also building on their successes. Innovations will be needed to achieve more realistic representations of meaning; we pointed to advances in neuro-symbolic modeling and grounded language learning as important developing areas. Additional attention will be needed on incorporating background knowledge and encouraging abstraction, so that the representations can be accessed by goals and beliefs. We hope that our five desiderata help to pose new challenges and stimulate new research on more psychologically realistic models of semantics.

GPT-3 and scaling up

We analyzed a number of NLP systems throughout this article (LSA, CBOW, BERT, GPT-2, a caption generation system, etc.), and GPT-2 was the largest model that we considered. Recently, its successor, GPT-3, was published by the same group at OpenAI (Brown ., 2020). In terms of architecture and training procedure, little has changed; GPT-3 is a large-scale autoregressive Transformer with the same architecture as GPT-2. In terms of scale, GPT-3 is a marvel of engineering that is strikingly larger than GPT-2. GPT-3 has 175 billion parameters (compared to GPT-2’s 1.5 billion) and was trained on large swaths of the internet for a total of about 500 billion tokens (25 times more data than GPT-2). We argued that training with more data would not itself lead to a model of psychological semantics, and thus GPT-3 conveniently offers a case study in scaling up. The model is new and much about it is still unknown; we did not analyze GPT-3 directly in our own tests as it was unavailable at the time of writing. Nevertheless, we offer some observations of what GPT-3 accomplishes and what it doesn’t.

GPT-3 is a strong few-shot learner. As with GPT-2, the authors do not fine-tune the model for specific tasks; it is trained solely on predicting the next word in a sequence. GPT-3 can perform many different tasks, however, through different text-based prompts that preface the relevant query. If provided with a few examples of question answering, grammar correction, or numerical addition, it often continues the task in response to new queries. In some cases, it can handle novel tasks that are unlikely to exist in the training corpus. Its flexibility to reuse its representations for new tasks, without having to re-train (Lake ., 2017), resembles the flexibility of human semantic representations.

In other ways, GPT-3 is no closer than GPT-2 to meeting the five desiderata for a model of psychological semantics. (Admirably, the GPT-3 paper has a thorough Limitations section that we draw from here.) First, GPT-3 is trained from text alone; thus, it is limited in all the ways that all ungrounded representations of words are (Desiderata 1 and 5). Second, GPT-3 aims to predict the next word in a sequence, no matter the task or the context; instead, humans produce words to express internal states such as goals, desires, etc. (Desideratum 2). The GPT-3 authors mention this limitation: “useful language systems (for example virtual assistants) might be better thought of as taking goal-directed actions rather than just making prediction” (Brown ., 2020, pg. 33). Third, we don’t know yet how GPT-3 performs on tests of complex concepts (Desideratum 3), although we wouldn’t be surprised to see GPT-3 outperform GPT-2. GPT-3 has far more training data, and all of the complex concepts in Murphy (1988) are likely covered during training. We note that harder tests with genuinely novel compositions would pose greater challenges.

Fourth, GPT-3 has no new mechanisms for connecting word representations to beliefs, or changing its beliefs based on linguistic input (Desideratum 4). In fact, the larger-scale corpus—combined with weaker curation and filtering compared to GPT-2—could weaken the firmness of any proto-beliefs the model does has, as there is likely more contradictory text in the training data now for a given fact. The authors report that GPT-3 frequently contradicts itself during text generation, a problem also present in GPT-2 as we discussed in Section 7. This is evident in the two generated news articles provided in the GPT-3 paper (p. 27). In one article, it’s hard to understand the core proposition the article is supposed to convey (Did Joaquin Phoenix pledge to wear a tux to the Oscars, or not?). In the other, which human judges thought was the most human-like sample, GPT-3 describes a split in the United Methodist Church. GPT-3 writes that “the new split will be the second in the church’s history. The first occurred in 1968…” But three sentences later, it goes on to describe a third split in the church, “In 2016, the denomination was split over ordination of transgender clergy…”, although perhaps it means an “intellectual” rather than “physical” split (if it knows the difference). Last, there is no evidence yet that GPT-3 has learned semantic representations that better capture abstract meaning, in the way needed for human-level instruction following (Desideratum 5).

Relatedly, although GPT-3 exhibits strong performance on a wide range of tasks, it performs poorly on several adversarially generated benchmarks that pit surface-level patterns against the underlying semantic content (Sakaguchi ., 2019; Nie ., 2020). These types of tasks pose particular challenges to pattern recognition systems, however powerful and impressive they are. The GPT-3 authors themselves note that there are limits to scaling up, writing that a fundamental limit to “scaling up any [language] model, whether autoregressive or bidirectional – is that it may eventually run into (or could already be running into) the limits of the pretraining objective” (pg. 33). We hope the five desiderata provide additional guidance in how to venture beyond pattern recognition and secure a more conceptual foundation for word meaning, leading to more powerful and more psychologically plausible models.

Past Critiques Within Psychology

The shortcomings of purely text-based models have not gone unnoticed in the psychological literature. In particular, the need for words to make connections with the world have been pointed out and debated. However, those discussions have tended to take a different focus from ours. First, the text-based approaches (primarily HAL and LSA) have often been contrasted with embodiment theories, which rely on perceptual symbol systems accounts of representation and meaning (e.g., Andrews ., 2014; Louwerse, 2007). In this theory, symbols are not only linked to perception but in fact are perceptual or motor simulations, which capture the essence of a concept (see Barsalou, 1999, for a review). In addition to perceptual symbols, emotion is also often mentioned as a potential feature of semantic representations (De Deyne ., 2018) that might not be captured by distributional models.

A second difference from our approach is that these discussions often take a purely empirical tack. They use a model to attempt to explain some human data and then ask if the fit could be improved by the addition of perceptual information (etc.)—or the reverse, seeing if a model improves the fit of human data. For example, De Deyne . (2018) investigated whether affective and featural information improved the modeling of human similarity judgments beyond text-based word embeddings. Louwerse (2007) developed a new measure of sentence coherence to see whether LSA could explain apparent embodiment effects found by Glenberg  Robertson (2000). Mandera . (2017, p. 66) found that adding text-based model data improved the prediction of priming results over human-generated feature lists.

These studies can provide insight into what information is contained in different models and measures, as well as which information is most relevant to a given task. However, they do not address the basic problem faced by pattern-based NLP models, which is how they accomplish the main goals of a theory of semantics, summarized in Table 1. Their evaluation data are generally relatedness measures of word pairs (or pairs of sentences)—that is, they remain primarily within the realm of words. As we reviewed above, describing a situation or scene does not have words as the input, so there must be a way to generate a sentence from perception and knowledge. And carrying out an instruction or changing one’s knowledge about the world cannot be done by activating words whose meanings are only other related words. By focusing on text-based tasks only, empirical tests cannot discover the main shortcomings of text-based approaches as theories of meaning, how language is used to describe and ultimately change the world.

That problem is one that applies even beyond the issue of embodiment. Embodiment makes specific claims about the nature of symbols, but those claims are not necessary to describe the problem with text-based theories of meaning. The primary issue is that words need to be connected to our concepts and then to the outside world in some way (Harnad, 1990). That is the critical link that must be included somewhere in every psychological theory of meaning. Note that we are not arguing that authors have been wrong in what they say about embodiment—there is a conflict between embodiment and distributional semantics. We are merely pointing out that those issues apply more broadly, whether or not one thinks of concepts as embodied.

8 Conclusion

There is a long tradition in cognitive science of theorists claiming that such-and-such computational paradigm cannot do such-and-such a task or reach a particular cognitive achievement. The record of such predictions is spotty. Often, the particular model criticized was replaced by future versions that were much better. By using a novel architecture, changing the learning algorithm, providing massive amounts of data, and so on, the putative impossible task turned out to be possible—at least to a reasonable degree of accuracy. We do not seek to join these ranks. Our point is not that text-based NLP models can’t achieve interesting and important things; they surely have already, as NLP systems are becoming increasingly prominent in our daily lives (intelligent assistants, dialogue systems, machine translation, etc.). They will continue to advance and accomplish more important things. But they alone will not form the basis of a psychological theory of word meaning.

This may not concern researchers and practitioners seeking to optimize performance on particular tasks. We are not suggesting that NLP should switch its focus to building models of psychological semantics, at least not in every case. If one has large quantities of training data, it may be a very good idea to develop a task-specific model using standard approaches, or fine-tune a language model on that specific task. For example, if the goal is to develop a question answering model for a specific domain, and one has millions of question-answer pairs for training, large-scale pattern recognition may well be sufficient. Our arguments in this paper will have little relevance in such cases.

In other cases, a model of psychological semantics is a higher bar worth reaching for, with real payoff in terms of performance. We will not rehash the limitations of text-based NLP systems as psychological models. However, it is worth considering whether embracing a more psychologically motivated semantics would improve performance in future language applications. To understand language productively and flexibly, to produce reasonable responses to novel input, and to hold actual conversations will likely require something closer to a conceptually based compositional semantics of the sort that people have (Marcus  Davis, 2019). We make the following suggestions.

First, semantic representations need to be based on content, information that makes contact with the world, and not just words connected to words. No matter how sophisticated the statistics or measure that links one word to others, word relations do not provide the basis for being able to talk about actual things and get information from communication. NLP models will need to move beyond pattern recognition and more firmly root themselves in concepts.

Second, word meanings have internal structure. You do not know what a dog is merely by knowing that it is connected to leashes, cats, mammal, leg, fur, toy, barking, etc. Your knowledge must be structured so that you know toys are things that dogs play with, fur is their body covering, mammal is a category they fall into, and so on. In identifying dogs, it is helpful to know that one of their parts is four legs. However, one must also understand in more detail what a dog’s leg is and what it means to be a part. A dog’s head next to four table legs does not add up to a dog, nor are those legs part of the dog. The relations between concepts and their constituents must be somehow encoded in order for the representation to work (Brachman, 1979).

This is accomplished by humans in part through a huge “front end” to their language learning, namely the perceptual-motor apparatus and knowledge of the world it provides. When a child first learns the word lion, it is almost certainly while viewing a representation of or an actual lion. The child can perceive the parts, overall shape, color, sound, and possibly behaviors of the lion, without linguistic input. Indeed, most studies of child word learning use the method of ostension to teach words: pointing at an object and labeling it. Children’s sophistication in interpreting such experiences is impressive (e.g., Markman, 1989). The result is that they do not need verbal information in order to learn a great deal about what a lion is and how to identify one in the future. No one needs to describe the lion’s face or say that the face is part of the lion, because that is directly learned via perception. Indeed, it is doubtful that any verbal description could accurately communicate what we know about lions’ faces. Achieving such inferences with a hybrid visual-language model is an exciting possibility, albeit a difficult one to achieve.

When parents do provide verbal instruction, it is often specifically labeled. For example, when teaching a word for a general concept, parents will often mention more specific examples and the set-superset relation that connects them, like “Chairs, tables, and sofas are all kinds of furniture” (Callanan, 1990). Such a statement indicates hierarchical relations between categories and suggests that chairs, tables, and sofas are at the same level (co-hyponyms) under the umbrella category of furniture. They may also provide information such as “Kitties say ‘meow’,” which also provide the relation between kitten and meow, a relation that is different than the one between kitten and animal, say. Knowing the specific relations between objects and properties—taught via sentences including the words for those objects and properties—gives much more information about the world than simply knowing words’ textual relations.

Contemporary multi-modal models, such as ones that learn to recognize objects and carry out instructions (see Figure 6), are taking real steps toward some of these goals. It might well be that one can say that these models have a semantics. But as potential theories of psychological semantics, their linguistic abilities are usually too limited and too tied to specific patterns in their training experience. Looking towards the future, a robot that learns language while interacting with objects in the world (as well as receiving textual input) might well develop a semantics such that words successfully relate to things in the world, and the robot can describe the world with its language. Saying whether the robot’s representations are functionally the same as human speakers’ representations would require a detailed comparison of its abilities and the representations’ internal structure. At present, these visual world models have simple linguistic representations that don’t seem adequate as descriptions of human meanings, but it remains to be seen how these models develop.

Final conclusion

Building a complete model of human word meaning requires, to some degree, building a conceptual structure of the world that people live in. Parts of that conceptual structure are linked to words, such that words pick out categories, properties, or relations. It doesn’t seem likely that one can build such a structure out of text statistics or predictions, although sentences could certainly be one input to a learning system trying to build a coherent structure of the world. Such a system would have to try to interpret the sentences, identifying relations, categories, and properties and then hypothesizing their connections. Purely text-based systems do not do that or even try to do that. It is worth exploring how much text-based systems can do, because sampling even internet-scale data is easier than building the artificial intelligence to learn the detailed information in word meaning. Humans, with the advantage of perception, action, and reasoning are able to build complex knowledge structures, and these are part of the basis of word meanings. Computational models of meaning will have to also form such structures if they are to be adequate psychological theories of meaning and, we propose, if they are to become sophisticated tools that can produce and understand language more broadly.


We thank Marco Baroni, Gemma Boleda, Tammy Kwan, Maxwell Nye, Josh Tenenbaum, and Tomer Ullman for helpful comments on an early draft. Through B. Lake’s position at NYU, this research was partially funded by NSF Award 1922658 NRT-HDR: FUTURE Foundations, Translation, and Responsibility for Data Science, and DARPA Award A000011479; PO: P000085618 for the Machine Common Sense program.


  • Agrawal . (2017) Agrawal2017Agrawal, A., Lu, J., Antol, S., Mitchell, M., Zitnick, CL., Parikh, D.  Batra, D.  2017. VQA: Visual question answering VQA: Visual question answering. Proceedings of the International Conference on Computer Vision (ICCV) Proceedings of the International Conference on Computer Vision (ICCV) ( 2425–2433). Venice, ItalySpringer US.
  • Alberti . (2019) Alberti2019Alberti, C., Lee, K.  Collins, M.  2019. A BERT baseline for the natural questions A BERT baseline for the natural questions. arXiv preprint.
  • Andrews . (2014) Andrews2014Andrews, M., Frank, S.  Vigliocco, G.  2014. Reconciling embodied and distributional accounts of meaning in language Reconciling embodied and distributional accounts of meaning in language. Topics in Cognitive Science6359–370.
  • Bahdanau . (2015) Bahdanau2015Bahdanau, D., Cho, K.  Bengio, Y.  2015. Neural machine translation by jointly learning to align and translate Neural machine translation by jointly learning to align and translate. International Conference on Learning Representations (ICLR). International Conference on Learning Representations (ICLR).
  • Baroni (2016) Baroni2016Baroni, M.  2016. Grounding distributional semantics in the visual world Grounding distributional semantics in the visual world. Language and Linguistics Compass103–13.
  • Baroni, Bernardi  Zamparelli (2014) Baroni2014aBaroni, M., Bernardi, R.  Zamparelli, R.  2014. Frege in space: A program of compositional distributional semantics Frege in space: A program of compositional distributional semantics. Linguistic Issues in Language Technology95–110.
  • Baroni, Dinu  Kruszewski (2014) Baroni2014Baroni, M., Dinu, G.  Kruszewski, G.  2014. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. 52nd Annual Meeting of the Association for Computational Linguistics (ACL)1238–247.
  • Baroni . (2010) Baroni2010bBaroni, M., Murphy, B., Barbu, E.  Poesio, M.  2010. Strudel: A corpus-based semantic model based on properties and types Strudel: A corpus-based semantic model based on properties and types. Cognitive Science34222–254.
  • Baroni  Zamparelli (2010) Baroni2010Baroni, M.  Zamparelli, R.  2010. Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. Conference on Empirical Methods in Natural Language Processing (EMNLP) Conference on Empirical Methods in Natural Language Processing (EMNLP) ( 1183–1193).
  • Barsalou (1999) Barsalou1999Barsalou, LW.  1999. Perceptual symbol systems Perceptual symbol systems. Behavioral and Brain Sciences22577–609.
  • Bisk . (2020) Bisk2020Bisk, Y., Holtzman, A., Thomason, J., Andreas, J., Bengio, Y., Chai, J.Turian, J.  2020. Experience grounds language Experience grounds language. arXiv preprint.
  • Blei . (2003) Blei2003Blei, DM., Ng, AY.  Jordan, MI.  2003. Latent dirichlet allocation Latent dirichlet allocation. Journal of Machine Learning Research3993–1022.
  • Bordes . (2017) Bordes2016Bordes, A., Boureau, YL.  Weston, J.  2017. Learning end-to-end goal-oriented dialog Learning end-to-end goal-oriented dialog. International Conference on Learning Representations (ICLR). International Conference on Learning Representations (ICLR).
  • Bosselut . (2020) Bosselut2020Bosselut, A., Rashkin, H., Sap, M., Malaviya, C., Celikyilmaz, A.  Choi, Y.  2020.

    CoMET: Commonsense transformers for automatic knowledge graph construction CoMET: Commonsense transformers for automatic knowledge graph construction.

    Annual Meeting of the Association for Computational Linguistics (ACL)4762–4779.
  • Brachman (1979) Brachman1979Brachman, RJ.  1979. On the epistemological status of semantic networks On the epistemological status of semantic networks. NV. Findler (), Associative networks: Representation and use of knowledge by computers Associative networks: Representation and use of knowledge by computers ( 3-50). New YorkAcademic Press.
  • Brown . (2020) Brown2020Brown, TB., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P.Amodei, D.  2020. Language models are few-shot learners Language models are few-shot learners. arXiv preprint.
  • Callanan (1990) Callanan1990Callanan, MA.  1990. Parents’ descriptions of objects: Potential data for children’s inferences about category principles Parents’ descriptions of objects: Potential data for children’s inferences about category principles. Cognitive Development5101-122.
  • Carey (1978) Carey1978Carey, S.  1978. The child as word learner The child as word learner. J. Bresnan, G. Miller  M. Halle (), Linguistic Theory and Psychological Reality Linguistic theory and psychological reality ( 264–293).
  • H. Chen . (2017) Chen2017Chen, H., Liu, X., Yin, D.  Tang, J.  2017. A survey on dialogue systems: Recent advances and new frontiers A survey on dialogue systems: Recent advances and new frontiers. ACM SIGKDD Explorations Newsletter.
  • X. Chen . (2015) Chen2015Chen, X., Fang, H., Lin, TY., Vedantam, R., Gupta, S., Dollar, P.  Zitnick, CL.  2015. Microsoft COCO captions: Data collection and evaluation server Microsoft COCO captions: Data collection and evaluation server. arXiv preprint.
  • Chierchia  McConnell-Ginet (1990) ChierchiaGennaroandMcConnell-Ginet1990Chierchia, G.  McConnell-Ginet, S.  1990. Meaning and grammar: An introduction to semantics Meaning and grammar: An introduction to semantics. Cambridge, MAMIT Press.
  • Cho . (2014) Cho2014Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H.  Bengio, Y.  2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation Learning phrase representations using RNN encoder-decoder for statistical machine translation. Conference on Empirical Methods in Natural Language Processing (EMNLP).
  • Clark (1983) Clark1983Clark, EV.  1983. Meaning and concepts Meaning and concepts. JH. Flavell  EM. Markman (), Manual of child psychology: Cognitive development (Vol 3) Manual of child psychology: Cognitive development (vol 3) ( 787-840). New YorkWiley.
  • Collins  Quillian (1968) Collins1968Collins, AM.  Quillian, MR.  1968. Retrieval time from semantic memory Retrieval time from semantic memory. Journal of Verbal Learning and Verbal Behavior8240–247.
  • Das . (2018) Das2018Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D.  Batra, D.  2018. Embodied question answering Embodied question answering. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR).
  • De Deyne . (2018) Deyne2018De Deyne, S., Navarro, DJ.  Collell, G.  2018. Visual and affective grounding in language and mind Visual and affective grounding in language and mind. PsyArXiv preprint.
  • Deng . (2009) Deng2009Deng, J., Dong, W., Socher, R., Li, LJ., Li, K.  Fei-Fei, L.  2009. ImageNet: A large-scale hierarchical image database ImageNet: A large-scale hierarchical image database. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Dennis (2007) Dennis2007Dennis, S.  2007. How to use the LSA website How to use the LSA website. TK. Landauer, DS. McNamara, S. Dennis  W. Kintsch (), Handbook of Latent Semantic Analysis Handbook of latent semantic analysis ( 57–70). Lawrence Erlbaum Associates.
  • Devlin . (2019) Devlin2019Devlin, J., Chang, MW., Lee, K.  Toutanova, K.  2019. BERT: Pre-training of deep bidirectional transformers for language understanding BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
  • Dowty . (1981) Dowty1981Dowty, DR., Wall, R.  Peters, S.  1981. Introduction to montague semantics Introduction to montague semantics. DordrechtKluwer Academic Publishers.
  • Eisenstein (2019) Eisenstein2019Eisenstein, J.  2019. Introduction to natural language processing Introduction to natural language processing. Cambridge, MAMIT Press.
  • Elman (1990) Elman1990Elman, JL.  1990. Finding structure in time Finding structure in time. Cognitive Science14179–211.
  • Erk (2012) Erk2012Erk, K.  2012. Vector space models of word meaning and phrase meaning: A survey Vector space models of word meaning and phrase meaning: A survey. Linguistics and Language Compass6635–653.
  • Firth (1957) Firth1957Firth, JR.  1957. A synopsis of linguistic theory 1930-1955 A synopsis of linguistic theory 1930-1955. Studies in Linguistic Analysis Studies in linguistic analysis ( 1–32).
  • Fodor (1975) Fodor1975Fodor, J.  1975. The language of thought The language of thought. Cambrida, MAHarvard University Press.
  • Frege (1892) Frege1892Frege, G.  1892. On sense and reference On sense and reference. Zeitschrift fur Philosophie und Philosophische Kritik10025–50, [Reprinted in P. T. Geach, M. Black (1960). Translations from the philosophical writings of Gottlob Frege. Oxford: Blackwell].
  • Gelman (2003) Gelman2003Gelman, SA.  2003. The essential child: Origins of essentialism in everyday thought The essential child: Origins of essentialism in everyday thought. New York, NYOxford University Press.
  • Gershman  Tenenbaum (2015) Gershman2010Gershman, SJ.  Tenenbaum, JB.  2015. Phrase similarity in humans and machines Phrase similarity in humans and machines. Proceedings of the 37th Annual Conference of the Cognitive Science Society. Proceedings of the 37th Annual Conference of the Cognitive Science Society.
  • Glenberg  Robertson (2000) Glenberg2000Glenberg, AM.  Robertson, DA.  2000. Symbol grounding and meaning: A comparison of high-dimensional and embodied theories of meaning Symbol grounding and meaning: A comparison of high-dimensional and embodied theories of meaning. Journal of Memory and Language43379–401.
  • Goodman . (2008) Goodman2008aGoodman, ND., Tenenbaum, JB., Feldman, J.  Griffiths, TL.  2008. A rational analysis of rule-based concept learning A rational analysis of rule-based concept learning. Cognitive Science32108–54.
  • Goodman . (2015) Goodman2014Goodman, ND., Tenenbaum, JB.  Gerstenberg, T.  2015. Concepts in a probabilistic language of thought Concepts in a probabilistic language of thought. E. Margolis  S. Laurence (), Concepts: New Directions. Concepts: New directions. Cambridge, MAMIT Press.
  • Graves . (2013) Graves2013aGraves, A., Mohamed, A.  Hinton, G.  2013. Speech recognition with deep recurrent neural networks Speech recognition with deep recurrent neural networks. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ( 6645–6649).
  • Griffin  Bock (2000) Griffin2000Griffin, ZM.  Bock, K.  2000. What the eyes say about speaking What the eyes say about speaking. Psychological Science11274-279.
  • Griffiths . (2007) Griffiths2007aGriffiths, TL., Steyvers, M.  Tenenbaum, JB.  2007. Topics in semantic representation. Topics in semantic representation. Psychological review114211–44.
  • Günther  Marelli (2016) Gunther2016Günther, F.  Marelli, M.  2016. Understanding karma police: The perceived plausibility of noun compounds as predicted by distributional models of semantic representation Understanding karma police: The perceived plausibility of noun compounds as predicted by distributional models of semantic representation. PLoS ONE11e0163200.
  • Hannun . (2014) Hannun2014Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E.Ng, AY.  2014. Deep speech: Scaling up end-to-end speech recognition Deep speech: Scaling up end-to-end speech recognition. arXiv preprint.
  • Harnad (1990) Harnad1990Harnad, S.  1990. The symbol grounding problem The symbol grounding problem. Physica D: Nonlinear Phenomena42335–346.
  • Harris (1954) Harris1954Harris, ZS.  1954. Distributional structure Distributional structure. Word10146–162.
  • He . (2016) He2016He, K., Zhang, X., Ren, S.  Sun, J.  2016. Deep residual learning for image recognition Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Hill . (2017) Hill2017Hill, F., Clark, S., Hermann, KM.  Blunsom, P.  2017. Understanding early word learning in situated artificial agents Understanding early word learning in situated artificial agents. arXiv preprint.
  • Hill . (2020) Hill2020Hill, F., Lampinen, A., Schneider, R., Clark, S., Botvinick, M., McClelland, JL.  Santoro, A.  2020. Environmental drivers of systematicity and generalisation in a stiuated agent Environmental drivers of systematicity and generalisation in a stiuated agent. International Conference on Learning Representations (ICLR). International Conference on Learning Representations (ICLR).
  • Hill . (2015) Hill2015Hill, F., Reichart, R.  Korhonen, A.  2015. Simlex-999: Evaluating semantic models with (genuine) similarity estimation Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics.
  • Hochreiter  Schmidhuber (1997) Hochreiter1997Hochreiter, S.  Schmidhuber, J.  1997. Long short-term memory Long short-term memory. Neural computation91735–1780.
  • Jia  Liang (2017) Jia2017Jia, R.  Liang, P.  2017. Adversarial examples for evaluating reading comprehension systems Adversarial examples for evaluating reading comprehension systems. Conference on Empirical Methods in Natural Language Processing, Proceedings (EMNLP) Conference on Empirical Methods in Natural Language Processing, Proceedings (EMNLP) ( 2021–2031).
  • Johnson, Hariharan, van Der Maaten . (2017) Johnson2017aJohnson, J., Hariharan, B., van Der Maaten, L., Fei-Fei, L., Zitnick, CL.  Girshick, R.  2017. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. Computer Vision and Pattern Recognition (CVPR). Computer Vision and Pattern Recognition (CVPR).
  • Johnson, Hariharan, van der Maaten . (2017) Johnson2017Johnson, J., Hariharan, B., van der Maaten, L., Hoffman, J., Fei-fei, L., Zitnick, CL.  Girshick, R.  2017. Inferring and executing programs for visual reasoning Inferring and executing programs for visual reasoning. International Conference on Computer Vision. International Conference on Computer Vision.
  • Keil (1989) Keil1989Keil, FC.  1989. Concepts, kinds, and cognitive development. Concepts, kinds, and cognitive development. Cambrida, MAMIT Press.
  • Kemp  Tenenbaum (2008) Kemp2008Kemp, C.  Tenenbaum, JB.  2008. The discovery of structural form. The discovery of structural form. Proceedings of the National Academy of Sciences10510687–92.
  • Kiela . (2016) Kiela2016Kiela, D., Bulat, L., Vero, AL.  Clark, S.  2016. Virtual embodiment: A scalable long-term strategy for artificial intelligence research Virtual embodiment: A scalable long-term strategy for artificial intelligence research. arXiv preprint.
  • Kim . (2019) Kim2019cKim, JS., Elli, GV.  Bedny, M.  2019. Reply to lewis et al.: Inference is key to learning appearance from language, for humans and distributional semantic models alike Reply to lewis et al.: Inference is key to learning appearance from language, for humans and distributional semantic models alike. Proceedings of the National Academy of Sciences (PNAS)11619239–19240.
  • Kintsch (2001) Kintsch2001Kintsch, W.  2001. Predication Predication. Cognitive Science25173-202.
  • Kintsch (2007) Kintsch2007Kintsch, W.  2007. Meaning in context Meaning in context. TK. Landauer, DS. McNamara, S. Dennis  W. Kintsch (), Handbook of Latent Semantic Analysis Handbook of latent semantic analysis ( 89–105). Lawrence Erlbaum Associates.
  • Krizhevsky . (2012) Krizhevsky2012Krizhevsky, A., Sutskever, I.  Hinton, GE.  2012. ImageNet classification with deep convolutional neural networks ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25 Advances in Neural Information Processing Systems 25 ( 1097–1105).
  • Kwiatkowski . (2019) Kwiatkowski2019Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C.Petrov, S.  2019. Natural questions: A benchmark for question answering research Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics7453–466.
  • Lake . (2015) LakeScience2015Lake, BM., Salakhutdinov, R.  Tenenbaum, JB.  2015. Human-level concept learning through probabilistic program induction Human-level concept learning through probabilistic program induction. Science3501332–1338.
  • Lake . (2017) Lake2016Lake, BM., Ullman, TD., Tenenbaum, JB.  Gershman, SJ.  2017. Building machines that learn and think like people Building machines that learn and think like people. Behavioral and Brain Sciences40E253.
  • Landauer (2007) Landauer2007Landauer, TK.  2007. LSA as a theory of meaning LSA as a theory of meaning. TK. Landauer, DS. McNamara, S. Dennis  W. Kintsch (), Handbook of Latent Semantic Analysis Handbook of latent semantic analysis ( 3–34). Lawrence Erlbaum Associates.
  • Landauer  Dumais (1997) Landauer1997Landauer, TK.  Dumais, ST.  1997. A solution to plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge A solution to plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review104211–240.
  • Lazaridou . (2015) Lazaridou2015Lazaridou, A., Pham, NT.  Baroni, M.  2015. Combining language and vision with a multimodal skip-gram model Combining language and vision with a multimodal skip-gram model. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) ( 153–163).
  • LeCun . (2015) Lecun2015LeCun, Y., Bengio, Y.  Hinton, G.  2015. Deep learning Deep learning. Nature521436–444.
  • Levelt (1993) Levelt1993Levelt, WJ.  1993. Speaking: From intention to articulation Speaking: From intention to articulation. Cambrida, MAMIT Press.
  • Lewis . (2019) Lewis2019Lewis, M., Zettersten, M.  Lupyan, G.  2019. Distributional semantics as a source of visual knowledge Distributional semantics as a source of visual knowledge. Proceedings of the National Academy of Sciences (PNAS)11619237–19238.
  • Li . (2016) Li2016bLi, J., Galley, M., Brockett, C., Spithourakis, GP., Gao, J.  Dolan, B.  2016. A persona-based neural conversation model A persona-based neural conversation model. Proc. of the 54th Annual Meeting of the Association for Computational Linguistics. Proc. of the 54th Annual Meeting of the Association for Computational Linguistics.
  • Lin . (2014) Lin2014Lin, TY., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D.Zitnick, CL.  2014. Microsoft COCO: Common objects in context Microsoft COCO: Common objects in context. arXiv preprint.
  • Linzen . (2016) Linzen2016Linzen, T., Dupoux, E.  Goldberg, Y.  2016. Assessing the ability of LSTMs to learn syntax-sensitive dependencies Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics4521–535.
  • Louwerse (2007) Louwerse2007Louwerse, MM.  2007. Symbolic or embodied representations: A case for symbol interdependency Symbolic or embodied representations: A case for symbol interdependency. TK. Landauer, DS. McNamara, S. Dennis  W. Kintsch (), Handbook of Latent Semantic Analysis Handbook of latent semantic analysis ( 107–120). Lawrence Erlbaum Associates.
  • Lucas (2000) Lucas2000Lucas, M.  2000. Semantic priming without association: A meta-analytic review Semantic priming without association: A meta-analytic review. Psychonomic bulletin & Review7618–630.
  • Lund  Burgess (1996) Lund1996Lund, K.  Burgess, C.  1996. Producing high-dimensional semantic spaces from lexical co-occurrence Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, & Computers28203-208.
  • Mandera . (2017) Mandera2017aMandera, P., Keuleers, E.  Brysbaert, M.  2017. Explaining human performance in psycholinguistic tasks with models of semantic similarity based on prediction and counting: A review and empirical validation Explaining human performance in psycholinguistic tasks with models of semantic similarity based on prediction and counting: A review and empirical validation. Journal of Memory and Language9257–78.
  • Mao . (2019) Mao2019Mao, J., Gan, C., Kohli, P., Tenenbaum, JB.  Wu, J.  2019. The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. International Conference on Learning Representations (ICLR). International Conference on Learning Representations (ICLR).
  • Marcus  Davis (2019) Marcus2019Marcus, G.  Davis, E.  2019. Rebooting AI: Building artificial intelligence we can trust Rebooting AI: Building artificial intelligence we can trust. New York, NYPantheon.
  • Marelli . (2017) Marelli2017Marelli, M., Gagné, CL.  Spalding, TL.  2017. Compounding as abstract operation in semantic space: Investigating relational effects through a large-scale, data-driven computational model Compounding as abstract operation in semantic space: Investigating relational effects through a large-scale, data-driven computational model. Cognition166207–224.
  • Markman (1989) Markman1989Markman, EM.  1989. Categorization and naming in children Categorization and naming in children. Cambridge, MAMIT Press.
  • Martin  Berry (2007) Martin2007Martin, DI.  Berry, MW.  2007. Mathematical foundations behind latent semantic analysis Mathematical foundations behind latent semantic analysis. TK. Landauer, DS. McNamara, S. Dennis  W. Kintsch (), Handbook of Latent Semantic Analysis Handbook of latent semantic analysis ( 35–56). Lawrence Erlbaum Associates.
  • McClelland . (2019) McClelland2019McClelland, JL., Hill, F., Rudolph, M., Baldridge, J.  Schütze, H.  2019. Extending machine language models toward human-level language understanding Extending machine language models toward human-level language understanding. arXiv preprint.
  • Mervis (1987) Mervis1987Mervis, CB.  1987. Child-basic object categories and early lexical development Child-basic object categories and early lexical development. U. Neisser (), Concepts and conceptual development: Ecological and intellectural factors in categorization Concepts and conceptual development: Ecological and intellectural factors in categorization ( 201–233). Cambrida, MACambridge University Press.
  • Mikolov . (2013) Mikolov2013aMikolov, T., Chen, K., Corrado, G.  Dean, J.  2013. Efficient estimation of word representations in vector space Efficient estimation of word representations in vector space. International Conference on Learning Representations (ICLR).
  • Mikolov . (2018) Mikolov2018Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C.  Joulin, A.  2018. Advances in pre-training distributed word representations Advances in pre-training distributed word representations. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC). Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC).
  • Mikolov . (2016) Mikolov2015Mikolov, T., Joulin, A.  Baroni, M.  2016. A roadmap towards machine intelligence A roadmap towards machine intelligence. arXiv preprint.
  • Mostafazadeh . (2016) Mostafazadeh2016Mostafazadeh, N., Misra, I., Devlin, J., Mitchell, M., He, X.  Vanderwende, L.  2016. Generating natural questions about an image Generating natural questions about an image. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics Proceedings of the 54th annual meeting of the association for computational linguistics ( 1802–1813).
  • Murphy (1988) Murphy1988Murphy, GL.  1988. Comprehending complex concepts Comprehending complex concepts. Cognitive Science12529–562.
  • Murphy (2002) Murphy2002Murphy, GL.  2002. The big book of concepts The big book of concepts. Cambridge, MAMIT Press.
  • Murphy  Andrew (1993) Murphy1993Murphy, GL.  Andrew, JM.  1993. The conceptual basis of antonymy and synonym in adjectives The conceptual basis of antonymy and synonym in adjectives. Journal of Memory and Language32301-319.
  • Murphy  Medin (1985) Murphy1985Murphy, GL.  Medin, DL.  1985. The role of theories in conceptual coherence The role of theories in conceptual coherence. Psychological Review92289–316.
  • Nie . (2020) Nie2020Nie, Y., Williams, A., Dinan, E., Bansal, M., Weston, J.  Kiela, D.  2020. Adversarial NLI: A new benchmark for natural language understanding Adversarial NLI: A new benchmark for natural language understanding. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL). Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL).
  • Osgood . (1957) Osgood1957Osgood, CE., Suci, GJ.  Tannenbaum, PH.  1957. The measurement of meaning The measurement of meaning. Urbana, ILUniversity of Illinois Press.
  • Pennington . (2014) Pennington2014Pennington, J., Socher, R.  Manning, CD.  2014. GloVe: Global vectors for word representation GloVe: Global vectors for word representation. Empirical Methods in Natural Language Processing (EMNLP). Empirical Methods in Natural Language Processing (EMNLP).
  • Petroni . (2019) Petroni2019Petroni, F., Rocktäschel, T., Riedel, S., Lewis, P., Bakhtin, A., Wu, Y.  Miller, A.  2019. Language models as knowledge bases? Language models as knowledge bases? Empirical Methods in Natural Language Processing (EMNLP). Empirical Methods in Natural Language Processing (EMNLP).
  • Radford . (2019) Radford2018aRadford, A., Wu, J., Child, R., Luan, D., Amodei, D.  Sutskever, I.  2019. Language models are unsupervised multitask learners Language models are unsupervised multitask learners. OpenAI Blog.
  • Rips . (1973) Rips1973Rips, LJ., Shoben, EJ.  Smith, EE.  1973. Semantic distance and the verification of semantic relations Semantic distance and the verification of semantic relations. Journal of Verbal Learning and Verbal Behavior121–20.
  • Roads  Love (2020) Roads2020Roads, BD.  Love, BC.  2020. Learning as the unsupervised alignment of conceptual systems Learning as the unsupervised alignment of conceptual systems. Nature Machine Intelligence276–82.
  • Rogers  McClelland (2004) Rogers2004Rogers, TT.  McClelland, JL.  2004. Semantic cognition Semantic cognition. Cambridge, MAMIT Press.
  • Roller . (2014) Roller2014Roller, S., Erk, K.  Boleda, G.  2014. Inclusive yet selective: Supervised distributional hypernymy detection Inclusive yet selective: Supervised distributional hypernymy detection. 25th International Conference on Computational Linguistics (COLING)1025–1036.
  • Rothe . (2017) Rothe2017Rothe, A., Lake, BM.  Gureckis, TM.  2017. Question asking as program generation Question asking as program generation. Advances in Neural Information Processing Systems (NIPS). Advances in Neural Information Processing Systems (NIPS).
  • Ruis . (2020) Ruis2020Ruis, L., Andreas, J., Baroni, M., Bouchacourt, D.  Lake, BM.  2020. A benchmark for systematic generalization in grounded language understanding A benchmark for systematic generalization in grounded language understanding. arXiv preprint.
  • Rumelhart (1978) Rumelhart1980Rumelhart, DE.  1978. Schemata: The building blocks of cognition Schemata: The building blocks of cognition. RJ. Spiro, BC. Bruce  WF. Brewer (), Theoretical issues in reading comprehension Theoretical issues in reading comprehension ( 33–58). Hillsdale, NJLawrence Erlbaum Associates.
  • Sakaguchi . (2019) Sakaguchi2019Sakaguchi, K., Bras, RL., Bhagavatula, C.  Choi, Y.  2019. WinoGrande: An adversarial winograd schema challenge at scale WinoGrande: An adversarial winograd schema challenge at scale. arXiv preprint.
  • Schank  Abelson (1977) Schank1977Schank, RC.  Abelson, RP.  1977. Scripts, plans, goals and understanding: An inquiry into human knowledge structures Scripts, plans, goals and understanding: An inquiry into human knowledge structures. Hillsdale, NJLawrence Erlbaum.
  • Schmidhuber (2015) Schmidhuber2015Schmidhuber, J.  2015. Deep learning in neural networks: An overview Deep learning in neural networks: An overview. Neural Networks6185–117.
  • Sennrich . (2016) Sennrich2016Sennrich, R., Haddow, B.  Birch, A.  2016. Neural machine translation of rare words with subword units Neural machine translation of rare words with subword units. 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016 - Long Papers31715–1725.
  • Serban . (2016) Serban2016aSerban, IV., Sordoni, A., Bengio, Y., Courville, A.  Pineau, J.  2016. Building end-to-end dialogue systems using generative hierarchical neural network models Building end-to-end dialogue systems using generative hierarchical neural network models. Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence.
  • Shepard (1974) Shepard1974Shepard, RN.  1974. Representation of structure in similarity data: Problems and prospects Representation of structure in similarity data: Problems and prospects. Psychometrika39373-421.
  • Shuster . (2020) Shuster2018Shuster, K., Humeau, S., Bordes, A.  Weston, J.  2020. Image-Chat: Engaging grounded conversations Image-Chat: Engaging grounded conversations. Association for Computational Linguistics (ACL). Association for Computational Linguistics (ACL).
  • Smith . (1988) Smith1988Smith, EE., Osherson, DN., Rips, LJ.  Keane, M.  1988. Combining prototypes: A selective modification model Combining prototypes: A selective modification model. Cognitive Science12485–527.
  • Solomon . (1999) Solomon1999Solomon, K., Medin, D.  Lynch, E.  1999. Concepts do more than categorize Concepts do more than categorize. Trends in Cognitive Sciences399–105.
  • Sordoni . (2015) Sordoni2015Sordoni, A., Galley, M., Auli, M., Brockett, C., Ji, Y., Mitchell, M.Dolan, B.  2015. A neural network approach to context-sensitive generation of conversational responses A neural network approach to context-sensitive generation of conversational responses. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT)196–205.
  • Tomasello . (1993) Tomasello1993Tomasello, M., Kruger, AC.  Ratner, HH.  1993. Cultural learning Cultural learning. Behavioral and Brain Sciences16495–552.
  • Urbanek . (2019) Urbanek2020Urbanek, J., Fan, A., Karamcheti, S., Jain, S., Humeau, S., Dinan, E.Weston, J.  2019. Learning to speak and act in a fantasy text adventure game Learning to speak and act in a fantasy text adventure game. arXiv preprint.
  • Vaswani . (2017) Vaswani2017Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, AN.Polosukhin, I.  2017. Attention is all you need Attention is all you need. Advances in Neural Information Processing Systems..
  • Vecchi . (2016) Vecchi2016Vecchi, EM., Marelli, M.  Zamparelli, R.  2016. Spicy adjectives and nominal donkeys: Capturing semantic deviance using compositionality in distributional spaces Spicy adjectives and nominal donkeys: Capturing semantic deviance using compositionality in distributional spaces. Cognitive Science.
  • Vinyals . (2014) Vinyals2014Vinyals, O., Toshev, A., Bengio, S.  Erhan, D.  2014. Show and tell: A neural image caption generator Show and tell: A neural image caption generator. International Conference on Machine Learning (ICML). International Conference on Machine Learning (ICML).
  • A. Wang . (2019) Wang2019aWang, A., Singh, A., Michael, J., Hill, F., Levy, O.  Bowman, SR.  2019. GLUE: A multi-task benchmark and analysis platform for natural language understanding GLUE: A multi-task benchmark and analysis platform for natural language understanding. International Conference on Learning Representations, (ICLR).
  • Z. Wang  Lake (2019) Wang2019Wang, Z.  Lake, BM.  2019. Modeling question asking using neural program generation Modeling question asking using neural program generation. arXiv preprint1–14.
  • Wisniewski (1997) Wisniewski1997Wisniewski, EJ.  1997. When concepts combine When concepts combine. Psychonomic Bulletin & Review4167-183.
  • Xian . (2017) Xian2017Xian, Y., Schiele, B.  Akata, Z.  2017. Zero-Shot learning - the good , the bad and the ugly Zero-Shot learning - the good , the bad and the ugly. Conference on Computer Vision and Pattern Recognition (CVPR). Conference on Computer Vision and Pattern Recognition (CVPR).
  • Xu . (2015) Xu2015Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R.Bengio, Y.  2015. Show, attend and tell: Neural image caption generation with visual attention Show, attend and tell: Neural image caption generation with visual attention. International Conference on Machine Learning (ICML). International Conference on Machine Learning (ICML).
  • Young . (2013) Young2013Young, S., Gašić, M., Thomson, B.  Williams, JD.  2013. POMDP-based statistical spoken dialog systems: A review POMDP-based statistical spoken dialog systems: A review. Proceedings of the IEEE1011160–1179.
  • Zhang . (2018) Zhang2018Zhang, S., Dinan, E., Urbanek, J., Szlam, A., Kiela, D.  Weston, J.  2018. Personalizing dialogue agents: I have a dog, do you have pets too? Personalizing dialogue agents: I have a dog, do you have pets too? 56th Annual Meeting of the Association for Computational Linguistics12204–2213.