Extending Machine Language Models toward Human-Level Language Understanding

12/12/2019 ∙ by James L. McClelland, et al. ∙ Google 34

Language is central to human intelligence. We review recent breakthroughs in machine language processing and consider what remains to be achieved. Recent approaches rely on domain general principles of learning and representation captured in artificial neural networks. Most current models, however, focus too closely on language itself. In humans, language is part of a larger system for acquiring, representing, and communicating about objects and situations in the physical and social world, and future machine language models should emulate such a system. We describe existing machine models linking language to concrete situations, and point toward extensions to address more abstract cases. Human language processing exploits complementary learning systems, including a deep neural network-like learning system that learns gradually as machine systems do, as well as a fast-learning system that supports learning new information quickly. Adding such a system to machine language models will be an important further step toward truly human-like language understanding.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Principles of Neural Computation

The principles of neural computation are domain general principles inspired by the human brain. They were first articulated in the 1950s (rosenblatt1961principles) and further developed in the 1980s in the Parallel Distributed Processing (PDP) framework for modeling cognition (rm86)

. A central idea of this approach is that structure in language and other cognitive domains is an emergent phenomenon, captured in learned connection weights and resulting in context-sensitive, distributed representations whose characteristics reflect a gradual, input-statistics dependent, learning process

(rm86past). The models treat the symbols and rules of classical linguistic theory as consequences of processing and learning, not entities whose structure must be built in. Instead of discrete symbols for linguistic units, these models rely on patterns of activity often called embeddings

over arrays of neuron-like processing units. Instead of explicit systems of rules, they rely on learned matrices of connection weights to map patterns on one set of units into patterns on others.

Another key principle is mutual constraint satisfaction (rumelhart77interactive). For example, the meaning of a sentence depends on its structure (its organization into constituent phrases); but so too can the structure depend on the meaning. Consider the sentence A boy hit a man with a __. If the missing word is bat, with a bat is read as part of a verb phrase headed by hit, and specifies the instrument used to carry out the action. But if beard fills the blank, with a beard is a part of a noun phrase describing who was hit. Even the segmentation of spoken or written language into elementary segments (e.g., letters) depends in part on meaning and context, as illustrated in Fig. 1. Rumelhart (rumelhart77interactive)

sketched an interactive model of language understanding in which estimates of probabilities about all aspects of an input constrain estimates of the probability of every other aspect. The idea was captured in a model of context effects in perception

(mcclelland1981interactive). Later work (hopfield1982neural, ackley1985learning) linked these ideas to energy minimization in statistical physics. Neural language modeling research, which we now describe, incorporates these principles.

Neural Language Modeling

An Early Neural Language Model

Elman (elm90) built on the principles above to demonstrate how neural models can capture key characteristics of language structure through learning, a feat once considered impossible (gold1967language)

. The model provides a starting point for understanding recent developments. Elman used the recurrent neural network shown in Fig. 

2a. The network was trained to predict the next word in a sequence () based on the current word () and its own hidden (that is, learned internal) representation from the previous time step (). These two inputs are each multiplied by a matrix of connection weights (represented by the arrows labeled and

) to produce vectors that are added to produce a vector of inputs to the hidden layer of units. The elements of this vector undergo a transformation limiting the range of activation values, resulting in the hidden layer representation. This in turn is multiplied with the matrix of weights to the output layer from the hidden layer (

) to generate a vector used to predict the probability of each of the possible successor words. Learning in this and other neural models is based on the discrepancy between the network’s output and the actual next word; the values of the connection weights are adjusted by a small amount to reduce the discrepancy. The network is called recurrent because the same connection weights (denoted by arrows in the figure) are used to process each successive word; the hidden representation

becomes the context representation for the next time step.

Elman demonstrated two crucial findings. First, after training his network to predict the next word in simple sentences like man eats bread, dog chases cat, and girl sleeps, the network’s representations captured the syntactic distinction between nouns and verbs (elm90)

. It also captured interpretable subcategories, as shown by a hierarchical clustering of the hidden-layer pattern (

) for each word in the training materials (Fig. 2b). This illustrates a key feature of learned representations in neural models: they capture specific as well as general or abstract information. By using a different learned representation for each word, the specific predictive consequences of that word can be exploited. Because representations for words that make similar predictions are similar, and because neural networks exploit similarity, the network can share knowledge about predictions among related words. Second, Elman (elm91) used both simple sentences like boy chases dogs and more complex ones like boy who sees girls chases dogs. In the more complex case, the verb chases must agree with the first noun (boy), not the closest noun (girls), since the sentence contains a main clause (boy chases dogs) interrupted by an embedded clause (boy [who] sees girls). The model learned to predict the verb form correctly despite the intervening clause, showing that it acquired sensitivity to the syntactic structure of language, and not just local co-occurrence statistics in its learned connections and distributed representations.

Figure 1: Two handwritten sentences illustrating how context influences the identification of letters in written text. The visual input we read as went in the first sentence and event in the second is the same bit of Rumelhart’s handwriting, cut and pasted into each of the two contexts. Reprinted from (rumelhart77interactive).
Figure 2: (a) Elman’s (1990) simple recurrent network and (b) his hierarchical clustering of the representations it learned, reprinted from (elm90).

Scaling Up to Process Natural Text

Elman’s task—predicting the next word in a sequence—has been central to neural language modeling. However, Elman trained his networks with only tiny fragments of what were effectively toy languages. For many years, it seemed they would not scale up. Beginning about 10 years ago, advances in machine language processing began to overcome this limitation for neural models. We describe two crucial developments next.

Long-distance dependencies and pretrained word embeddings

A challenge for language prediction is the indefinite length of the context that might be relevant. Consider this passage:

John put some beer in a cooler and went out with his friends to play volleyball. Soon after he left, someone took the beer out of the cooler. John and his friends were thirsty after the game, and went back to his place for some beers. When John opened the cooler, he discovered that the beer was ___.

Here a reader expects the missing word to be gone. Yet if we replaced someone took the beer with someone took the ice the expected word would be warm instead. Furthermore, any amount of additional text between beer and gone would not change the predictive relationship. Elman’s network could take only a few words of context into account, reflecting a larger challenge known as the vanishing gradient problem (bengiocitation)

. In essence, the magnitude of the learning signal that determines the adjustments to connection weights tends to decrease exponentially as the number of layers of weights between an input and an output increases. The development of neural network modules called Long-Short-Term-Memory (LSTM) modules

(hochreiter1997long) that partially overcame this limitation was therefore crucial, greatly increasing the contextual range of neural models. In another crucial development, researchers began to use pre-trained word embeddings derived from learning predictive relationships among words (collobert2011natural, mikolov2013distributed). When training a neural model for a specific task, such embeddings could then be used to directly represent training words in the model’s input. The embeddings were based on the aggregate statistics of large text corpora, and captured both general and specific predictive relationships, supporting generalization at both general and specific levels. Using these embeddings, task-focused models trained with relatively small data sets could generalize what they learned from training on frequent words (such as sofa) to infrequent words with similar predictive relationships (such as settee).

A limitation of the above approach is that the same representation of a word is used every time it occurs, regardless of context. However, in line with the principle of mutual constraint satisfaction, humans interpret words, including ambiguous words like bank, in accordance with the context (simpson1981meaning), rapidly assigning a contextually appropriate meaning to each word based on all other words. A fixed embedding also limits predicting other words; the predictive implication of the word bank depends on which kind of bank is involved. Initial steps toward context sensitivity (rudolph2017structured) recognize this limitation. We consider a fuller solution in the next section.

Figure 3: High-level depiction of one stage of the bidirectional attention architecture, shown constructing a contextually appropriate representation of bank based on other words in the same sentence. Diagonal lines show how inputs from other positions reach bank’s position; the same occurs at all word positions. See text for details.

Attention and fully contextualized embeddings

Breakthroughs in neural language modeling have come from recent models that construct fully contextualized word representations (peters18elmo, Vaswani2017, devlin:etal:2018, xlnet, universaltransformer). The models represent words via a mutual constraint satisfaction process in which each word in a text span influences the representation of every other word. BERT (devlin:etal:2018) is a key model in this class, illustrated in Fig. 3 as it encodes the end of the sentence John reached the bank of the river (example from (uszkoreit17transformer)). An initial context-independent representation of each word is first combined with a positional representation (bottom row of boxes in the figure). Then, a separate copy of the same neural network module updates the representation in each position with input from all other positions. This process uses queries (red) at each position that are compared to keys (yellow) at all positions to form weightings (mauve boxes) that determine how strongly the values (blue) from each position contribute to the combined attention vectors (grey boxes) that provide context-sensitivity. The computation iterates over many layers, allowing words whose representations have been influenced by their context to influence the representations of other words. The process allows selection among alternative distinct meanings of a word like bank as well as graded shading of word meaning by context, for example assigning different emotional valance to the dogs in the dog wagged its tail and the dog snarled.

Key ingredients of the contemporary models are (i) bidirectionality of information flow during processing and (ii) an attention mechanism, which replaces the LSTM mechanism to enhance the exploitation of context. Bidirectionality matters because the meaning of a word in context depends on what comes after it as well as what comes before; in the example sentence, the last word, river, determines the meaning of bank. While it is remarkable how much can be done with strictly left-to-right constraint propagation (radford:etal:2018), bidirectionality allows neural models to implement a mutual constraint satisfaction process in which the representation of each word depends on all other words. The attention mechanism distinguishes these models from earlier LSTM-based neural language models, and is used in both bidirectional and left-to-right models. Attention has proven to be even more effective than the LSTM mechanism in allowing networks to capture long-distance dependencies. Rather than requiring information about a context word to reach a target word through an iteration of the LSTM for every intervening word, the context reaches the target directly. Likewise, gradient learning signals skip over the intervening words, avoiding the dissipation of learning signal that would otherwise occur.

BERT-based models have produced remarkable improvements on a wide range of language tasks (devlin2018bert, wang2018glue). The models can be pre-trained on massive text corpora, providing useful representations for subsequent tasks for which little specific training data exists (wang2019superglue). These models seem to capture syntactic, semantic and world knowledge and they are beginning to address tasks once thought beyond their reach. For example, Winograd Schema challenge (levesque2012winograd) requires determining the referent of a pronoun (here it) in a sentence such as The trophy did not fit in the suitcase because it was too ___. For a person, world knowledge tells us that if the missing word is big the referent must be the trophy, but if it is small the referent must be the suitcase. The latest models achieve ever-higher scores on benchmarks including variants of the Winograd challenge (raffel2019exploring). However, variants of test materials that do not fool humans continue to stymie even the best models (Jia:Liang:2017), and further refinements in the models and their assessment will be required before it will be clear what such models can achieve.

The Human Integrated Understanding System (IUS)

Situations and objects

Despite the successes of neural language modeling, an important limitation is that these models are purely language based. We need models in which language is a part of an integrated understanding system (IUS) for understanding and communicating about the situations we encounter and the objects that participate in them. Representations of situations constitute our models of our world and guide our behavior and our interpretation of language. Indeed, resolving the referent of the pronoun in a Winograd sentence would follow from building a representation of the situation the sentence describes. In the situation in which a trombone does not fit in a suitcase, the natural reason would be that the trombone is too big or the suitcase too small; the identity of the referent of the pronoun follows from realizing this. Thus, solving the Winograd Schema challenge is a natural byproduct of the human language understanding process. In short, we argue that language evolved for communication about situations and our systems should address this goal.

Situations can be concrete and static, such as one where a cat is on a mat, or they may be events such as one where a boy hits a ball. They can be conceptual, social or legal, such as one where a court invalidates a law. They may even be imaginary. The objects may be real or fictitious physical objects or locations; animals, persons, groups or organizations; beliefs or other states of mind; or entities such as theories, laws or constitutions. Here we focus on concrete situations, considering other cases below. Our proposal builds on classic work in linguistics (lak87, langacker1987foundations), human cognition (bransford1972contextual)

, artificial intelligence

(schank1983dynamic), and an early PDP model (STJOHN1990217) and dovetails with an emerging perspective in cognitive neuroscience (hasson2018grounding).

As humans process language, we construct a representation of the situation the language describes from the stream of words and other available information. Words and their sequencing serve as clues to meaning (rumelhart1979problems) that jointly constrain the understanding of the situation and each object participating in it (STJOHN1990217). Consider this passage:

John spread jam on a slice of bread. The knife had been dipped in poison. John ate the bread and soon began to feel sick.

We can make many inferences here: that the jam was spread with the poisoned knife, that some of the poison was transferred to the bread, and that this may have led to John’s sickness. Note that the entities here are objects, not words, and the situation could instead be conveyed by a silent movie.

Evidence that humans construct situation representations comes from classic work by Bransford and colleagues (bransford1972contextual, BARCLAY1974471). This work demonstrates that (1) we understand and remember texts better when we can relate the statements in the text to a familiar situation; (2) information that conveys aspects of the situation can be provided by a picture accompanying the text; (3) the characteristics of the objects we remember depend on the situations in which they occurred in a text; (4) we represent in memory objects not explicitly mentioned in texts; and (5) after hearing a sentence describing spatial or conceptual relationships among objects, we retain memory for these relationships rather than the linguistic input. Further, evidence from eye movements shows that people use linguistic and non-linguistic input jointly and immediately as they process language in context (tanenhaus1995integration). For example, just after hearing The man will drink … participants look at a full wine glass rather than an empty beer glass (altmann2007real). After hearing The man drank, they look at the empty beer glass. Thus, language understanding involves constructing–in real time–a representation of the situation being described by language input, including the objects involved and their spatial relationships with each other, using visual and linguistic inputs.

Figure 4: Proposed integrated understanding system (IUS). The blue box contains the neocortical system, with each oval forming an embedding (representation) of a specific kind of information. Blue arrows represent learned connections that allow the embeddings to constrain each other. The red box contains the medial temporal lobe system, thought to provide a network that stores an integrated embedding of the neocortical system state. The red arrow represents fast-learning connections that bind the elements of this embedding together for later reactivation and use. Green arrows connecting the red and blue ovals support bidirectional influences between the two systems. (A) and (B) are two example inputs discussed in the main text.

The understanding system in the brain

Fig. 4 presents a depiction of our proposed integrated understanding system. Our proposal is both a theory of the brain basis of understanding and a proposed architecture for future language understanding research. It is largely consistent with proposals in (hasson2018grounding). First, we focus on a part of the system, called the neocortical system, that is sufficient to combine linguistic and non-linguistic input to understand the object and situation referred to upon hearing a sentence containing the word bat while observing a corresponding situation in the world. This system consists of the blue ovals (corresponding to pools of neurons in the brain) and blue arrows (connections between these pools) in the figure. One population subserves a visual representation/embedding of the given situation, and another subserves a non-semantic linguistic representation capturing the sound structure (phonology) of co-occurring spoken language. The third represents objects participating in the situation, and the fourth represents the overall situation itself. Within each pool, and between each connected pair of pools, the neurons are reciprocally interconnected via learning-dependent pathways allowing mutual constraint satisfaction among all of the elements of each of the embedding types. Brain regions for representing visual and linguistic inputs are well-established, and the evidence for their involvement in a mutual constraint satisfaction process is substantial (mcclelland2014interactive). Here we focus on the evidence for object and situation representations in the brain.

Object representations

A brain area near the front of the temporal lobe houses neurons whose activity provides an embedding capturing the properties of an object someone is considering (patterson2007you). Damage to this area impairs the ability to name objects, to grasp objects correctly in service of their intended use, to match objects with their names or the sounds that they make, and to pair objects that go together with each other, either from their names or from pictures. Models that capture these findings (rogers2004structure) treat this area as the hidden layer of an interactive, recurrent network with bidirectional connections to other layers corresponding to brain areas that represent different types of object properties including the object’s name. In these models, an input to any of these other layers activates the corresponding pattern in the hidden layer, which in turn activates the corresponding patterns in the other layers, supporting, for example, the ability to produce the name of an object from visual input. Damage (simulated by removal of neurons in the hidden layer) degrades the model’s representations, capturing the patterns of errors made by patients with the condition.

Situation representations

The situation representation specifies the event or situation conveyed by visual and/or language input. Evidence from behavioral studies indicates that construction of a situation representation can occur with or without language input (zwaan1998situation) or through the convergent influence of both sources of information (altmann2007real). Cognitive neuroscience research supports the idea that the situation representation arises in a set of interconnected brain areas primarily located in the frontal and parietal lobes (Ranganath2012, hasson2018grounding). In recent work, brain imaging data is used to analyze the time-varying patterns of neural activity that arise during the processing of a temporally extended event sequence. The activity patterns that represent corresponding events in a sequence are largely the same, whether the information about the sequence comes from watching a movie, hearing or reading a narrative description, or recalling the movie after having seen it (doi:10.1093/cercor/bhx202, Baldassano2017).

Situation-specific constraints

An important feature of our brain-inspired proposal is the use of distinct situation and object representations, and the idea that the constraints on the participating objects are mediated by the situation representation. The advantage of this is that it allows these constraints to be situation-specific. For example, the dogs in the events described by the sentences the boy ran to the dog and the boy ran from the dog are likely to be different, and a comprehender will represent them differently (BARCLAY1974471). In general, context-sensitivity is best captured by a mediating representation rather than direct associations among constituents (hinton81semanticnets). While BERT-like models might partially capture such constraints implicitly, an integrated situation representation may be more effective.

In summary, the brain contains distinct areas that represent each input modality and the objects and situations conveyed through them, computing these representations through a mutual constraint satisfaction process combining language and other inputs. Emulating this architecture in machines could contribute to achieving human-level language understanding. What would a computational instantiation of our system look like? It is likely that a biologically realistic version would differ in some ways from the most effective machine version. Contemporary attention-based language models can deploy attention over tens to thousands of words kept in their current system state, but evidence from brain imaging data collected during movie comprehension suggests that activation states in visual, speech, and object areas change rapidly as events unfold, while the brain state tends to be more constant, changing only at event boundaries in brain areas associated with situation representations (Baldassano2017). In humans, spanning longer temporal windows, including multi-event narratives, appears to require the complementary learning system we consider next.

Complementary Learning Systems

Learning plays a crucial role in understanding. The knowledge in the connection weights in the neural networks we have described is acquired through the accumulation of very small adjustments based on each experience. The connection weights gradually become sensitive to subtle higher-order statistical relationships, taking more and more context into account as learning continues (cleeremans1991learning), and exhibiting sensitivity both to general and recurring specific information (e.g., names of close friends and famous people). In our proposed architecture, this gradual process occurs in all the pathways represented by the blue arrows in Fig. 4, just as it does in the artificial neural language models considered above. However, this learning mechanism is not well suited to acquiring new information rapidly, and attempting to learn specific new information quickly by focused repetition leads to catastrophic interference with what is already known (mccloskey1989catastrophic).

Yet, humans can often rely on information presented just once at an arbitrary time in the past to inform our current understanding. Returning to the beer John left in the cooler, to anticipate that John will not find the beer when he opens the cooler again, we must rely on information acquired when we first heard about the beer being stolen. Such situations are ubiquitous, and a learning system must be able to exploit such information, but BERT and the other models described previously are limited in this way. Though some models hold long word sequences in an active state, when one text is replaced with another, only the small connection adjustments described above remain, leaving these systems without access to the specifics of the prior information.

The human brain contains a system that addresses this limitation. Consider a situation in which someone sees a previously unfamiliar object and hears a spoken statement about it, as illustrated in Fig. 4B. The visual input provides one source of information about the object (a previously unfamiliar animal), while the linguistic input provides its name. Humans show robust learning after just two brief exposures to such pairings (warren2019fast). This form of learning depends on the hippocampus and adjacent areas in the medial temporal lobes (MTL) of the brain (warren2019fast). While details of the role of the MTL in learning and memory continue to be debated (squire92, yonelinas2019), there is consensus that the MTL is crucial for the initial formation of new memories, including memories for specific events and their constituent objects and situations, while general knowledge, the ability to understand language, and previously acquired skills are unaffected by MTL damage.

The evidence from MTL damage suggests there is a fast learning system in the MTL. According to complementary learning systems theory (CLST) (Marr71archicortex, mcclelland95cls, kumaran16cls) this system (shown in red in Fig. 4) provides an integrated representation of the understanding system state, and employs modifiable connections within the MTL (red arrow) that can change rapidly to support new learning based on a single experience. The green arrows represent connections that carry information between the neocortical (blue) and MTL (red) systems so the systems can influence each other.

Let us consider how, according to CLST, a human can learn about the numbat (see Fig. 4B) from an experience seeing it and hearing a sentence about it (kumaran16cls). The input to the MTL is thought to be an embedding that captures (i.e., can be used to reconstruct) the patterns in the neocortical areas that arise from the experience. Networks within the MTL (not shown) map the MTL input representation to a sparser one deep inside the MTL, maximizing distinctness and minimizing interference among experiences (Marr71archicortex). Large connection weight changes within the MTL associate the elements of the sparse representation with each other and with the MTL input representation. When the person hears the word numbat in a later situation, connections to the MTL from the neocortex activate neurons in the MTL input representation. The weight changes that occurred on prior exposure support the reconstruction of the complete MTL representation, and return connections to the neocortex then support approximate reconstruction of the visual, speech, object and situation representations formed during the initial exposure to the numbat. These representations are the explicit memory for the prior experience, allowing the cortical network to use what it learned from one prior exposure to contribute to understanding the new situation.

Integrating information into the neocortex

How can knowledge initially dependent on the MTL be integrated into the neocortex? According to CLST (mcclelland95cls), the neocortex learns gradually through interleaved presentations of new and familiar items; this process avoids interference of new items with what is already known. Interleaved learning can occur through ongoing experience, as would happen if, for example, we acquire a pet numbat that we then see every day, while continuing to have other experiences. Interleaving may also occur during rest or sleep through reactivation and replay of patterns stored in the MTL: Indeed, spontaneous replay of short snippets of previously experienced episodes occurs within the MTL during sleep and between behavioral episodes (see (kumaran16cls) for review).

In summary, the human brain contains complementary learning systems that support the simultaneous use of many sources of information as we seek to understand an experienced situation. One of these systems acquires an integrated system of knowledge gradually through interleaved learning, including our knowledge of the meanings of words, the properties of frequently-encountered objects, and the characteristics of familiar situations. The other complements this system to allow information from specific experiences to be brought to bear on the interpretation of a current situation.

Toward an Artificial Integrated Understanding System

Here we consider current deep learning research that is taking steps consistent with our IUS proposal, and point toward future directions that will be needed to achieve a truly integrated and fully functional understanding system. We begin within the context of language grounded within concrete visual and physical situations, then consider the role of memory, and finally turn to the extension of the approach to address understanding of more abstract objects, situations, and relations.

Mapping vision and language to representations of objects

How might a model learn about situations that can occur in the world? The need for an artificial system of language understanding to be grounded in the external world has long been discussed. An early example is Winograd’s SHRDLU system (winograd1972understanding), which produced and responded to language about a simulated physical world. Deep learning has enabled joint, end-to-end training of perceptual input and language (i.e., in a single synchronous optimization process). Recent advances with such models have greatly improved performance, resulting in applications transforming user experiences. When presented with a photograph, networks can now answer questions such as what is the man holding? or what color is the woman’s shirt? (macleod2017understanding), demonstrating an ability to combine information from vision and language to understand a class of situations.

A very recent model (hudson2019learning) explicitly represents the objects in a scene, their properties, and their relations to other objects in a designed scene graph with slots for objects, their properties, and relations. It encodes questions as a series of instructions to find a target object or relation by searching the graph to answer a query. For example, the question what is the object beside the yellow bowl? can be answered by finding the yellow bowl, finding an object linked to it with the ‘beside’ relation, and then reading out this object’s identity. The approach advances the state of the art, though a large gap relative to humans remains. The model shares important properties with our proposal in that it explicitly treats language input as querying the model’s representation of the objects in the scene, and their conceptual properties. A natural extension consistent with IUS would be to build up scene representations using a combination of visual input and language, allowing text to enrich the representations of objects and relations.

A question this work raises is whether to build structured representations into one’s model. This is advocated in (hudson2019learning), but natural structure exhibits flexible embedding relationships and is often only approximately characterized by explicit taxonomies, motivating use of emergent connection-based rather than hard-coded representational structures (rumelhart1986schemata). A challenge, then, is to achieve comparable performance with models in which these concept-based object and relational representations emerge through learning.

Embodied models for language understanding

Beyond the integration of vision and language, as illustrated in Fig. 4, we see progress coming from an even fuller integration of many additional information sources. Every source provides a basis for distinct learning objectives and enables information that is salient in one source to bootstrap learning and inference in the other. Important additional sources of information include non-language sound, touch and force-sensing, and information about one’s own actions.

Incorporating additional information sources will allow an IUS to go beyond answering questions about static images. Since image data has no temporal aspect, such models lack experience of events or processes. While models that jointly process video and language (yu2016video) may acquire some sensitivity to event structure and commonplace causal relationships, these systems do not make choices affecting the world they observe. Ultimately, an ability to link one’s actions to their consequences as one intervenes in the observed flow of events and interacts with other agents should provide the strongest basis for acquiring notions of cause and effect, of agency, and of self and others.

These considerations motivate recent work on agent-based language learning in simulated interactive 3D environments (hermann2017grounded, das2017question, chaplot2017gated, oh2017zero). In (hill2019emergent), an agent was trained to identify, lift, carry and place objects relative to other objects in a virtual room, as specified by simplified language instructions. At each time step, the agent received a first-person visual observation (pixel-based image) that it processed to produce a representation of the scene. This was concatenated to the final state of an LSTM that processed the instruction, then passed to an integrative LSTM whose output was used to select a motor action. The agent gradually learned to follow instructions of the form find a pencil,  lift up a basketball and put the teddy bear on the bed, encompassing 50 objects, and requiring up to action steps to complete. Such instructions require the construction of representations based on language stimuli that enable the identification of objects and relations across space and time, and the integration of this information to inform motor behaviors. Importantly, without building in explicit object representations, the system supported the interpretation of novel instructions. For instance, an agent trained to lift a set of 20 objects in the environment, but only trained to put 10 of those in a specific location could place the remaining objects in the same location on command with over 90% accuracy.

Neural models often fail to exhibit systematic generalization, leading some to propose that more structure should be built in (lake2015human). While the agent’s level of systematicity does not reach human levels, this work suggests that grounding language learning can help support systematicity without building it in

. Critically, the agent’s systematicity was contingent on the ego-centric, multimodal and temporally-extended experience of the agent. On the same set of generalization tests, both an alternative agent with a fixed perspective on a 2D grid world and a static neural network classifier that received only individual still image stimuli exhibited significantly worse generalization that a fully situated agent (Fig. 

5). This underlines how affording neural networks access to rich, multi-modal interactive environments can stimulate the development of capacities that are essential for language learning.

Despite these promising signs, achieving fully human levels of generalization remains an important challenge. We propose that incorporating an MTL-like fast learning system will help address this by allowing new words to be linked to the corresponding object from just a single episode supporting use of the word to refer to the referent in other situations.

Figure 5: Left: the (allocentric) agent perspective in a 2D grid-world. The text indicates the language instruction, requiring the agent (white striped cell) to visit one of the red figures and move it to the white square. Right: the first-person perspective of the situated agent, addressing an equivalent task to the one posed in the grid world.

An artificial fast learning system

What might a fast learning system in an implementation of an integrated understanding system look like? The memory system in the differentiable neural computer (DNC) (graves2016hybrid) is one possibility. These systems store embeddings derived from past episodes in slots that could store Integrated System State representations like those we attribute to the human MTL. Alternatively, they could store the entire ensemble of states across the visual, speech, object, and situation representations. Though we do not believe the brain has a separate slot for each memory, it can be useful to model it as though it does (kumaran16cls), and artificial systems with indefinite capacity could exceed human abilities in this regard. How might the retrieval of relevant information work in such a system? The DNC employs a querying system similar to the one in BERT and to proposals in the human memory literature, whereby the representation retrieved from the MTL is weighted by the degree of match between a query (which we would treat as coming from the neocortex) and each vector stored in memory. Close matches are favored in this computation (hintzman1984minerva), so that when there is a unique match (such as a single memory containing a once seen word like numbat), the corresponding object and situation representation could be retrieved. Retrieval could be based on a combination of context and item information, similar to human memory (polyn2009context). Working out the details of such a system presents an exciting research direction for the future.

Beyond concrete situations

Our discussion has focused primarily on concrete situations. However, language allows us to discuss abstract ideas and situations, where grounding in the physical world can be very indirect and our learning about it comes primarily from language-based materials. Consider, for example, an understanding of Brexit. Concrete events involving actual people have occurred, but the issues and questions under consideration can only be communicated through language. What is the way forward toward developing models that can understand such a complex situation?

Language may have evolved in part to support transmission of complex, hierarchical knowledge about tools (stout:chaminade:2012). However, utterances also had to support abstraction and complex dependencies between concrete objects, as well as social relationships between speakers. Words themselves provided a new abstract substrate for characterizing other words (bryson:2008). Word embeddings are one implementation of this substrate: they can characterize abstract words like justice and represents without directly grounding them. In humans, encyclopedic knowledge grounded in such representations can be acquired in an MTL-dependent way from reading an encyclopedia article just once. For machines, forming integrated system state representations capturing the content and using a DNC-like system for their storage and retrieval might provide a starting place for enabling such knowledge to be acquired and used effectively.

That said, words are uttered in real world contexts and there is a continuum between grounding and language-based linking for different words and different uses of words. For example, career is not only linked to other abstract words like work and specialization but also to more grounded concepts such as path and its extended metaphorical use for discussing the means to achieve goals (bryson:2008). Embodied, simulation-based approaches to meaning (Lakoff/Johnson:1980, feldman:narayanan:2004) build on this observation to bridge from concrete to abstract situations via metaphor. They posit that understanding words like grasp is directly linked to neural representations of the action of grabbing and that this circuitry is recruited for understanding the word in contexts such as grasping an idea. We consider situated agents as a critical catalyst for learning about how to represent and compose concepts pertaining to spatial, physical and other perceptually immediate phenomena—thereby providing a grounded edifice that can connect to both the low level brain circuitry for motor action and to representations derived primarily from language.

Conclusion

Language does not stand alone. The integrated understanding system in the brain connects language to representations of objects and situations and enhances language understanding by exploiting the full range of our multi-sensory experience of the world, our representations of our motor actions, and our memory of previous situations. We have argued that the next generation language understanding system should emulate this system in the brain and we have sketched some aspects of the form such a system might take. While we have emphasized understanding of concrete situations, we have argued that understanding more abstract language builds upon this concrete foundation, pointing toward the possibility that it may someday be possible to build artificial systems that understand abstract situations far beyond the concrete and the here-and-now. In sum, we have proposed that modeling the integrated understanding system in the brain will take us closer to capturing human-level language understanding and intelligence.

This article grew out of a workshop organized by HS at Meaning in Context 3, Stanford University, September 2017. We thank Chris Potts for discussion. HS was supported by ERC Advanced Grant #740516.

References