Visually Grounded, Situated Learning in Neural Models

05/29/2018 ∙ by Alexander G. Ororbia, et al. ∙ Google Rochester Institute of Technology Penn State University 0

The theory of situated cognition postulates that language is inseparable from its physical context--words, phrases, and sentences must be learned in the context of the objects or concepts to which they refer. Yet, statistical language models are trained on words alone. This makes it impossible for language models to connect to the real world--the world described in the sentences presented to the model. In this paper, we examine the generalization ability of neural language models trained with a visual context. A multimodal connectionist language architecture based on the Differential State Framework is proposed, which outperforms its equivalent trained on language alone, even when no visual context is available at test time. Superior performance for language models trained with a visual context is robust across different languages and models.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The theory of situated cognition postulates that a person’s knowledge is inseparable from the physical or social context in which it is learned and used (Greeno and Moore, 1993). Similarly, Perceptual Symbol Systems theory holds that all of cognition, thought, language, reasoning, and memory, is grounded in perceptual features Barsalou (1999). Knowledge of language cannot be separated from its physical context, which allows words and sentences to be learned by grounding them in reference to objects or natural concepts on hand (see Roy and Reiter, 2005, for a review). Nor can knowledge of language be separated from its social context, where language is learned interactively through communicating with others to facilitate problem-solving. Simply put, language does not occur in a vacuum.

Yet, statistical language models, typically connectionist systems, are often trained in such a vacuum. Sequences of symbols, such as sentences or phrases composed of words in any language, such as English or German, are often fed into the model independently of any real-world context they might describe. In the classical language modeling framework, a model learns to predict a word based on a history of words it has seen so far. While these models learn a great deal of linguistic structure from these symbol sequences alone, acquiring the essence of basic syntax, it is highly unlikely that this approach can create models that acquire much in terms of semantics or pragmatics, which are integral to the human experience of language. How might one build neural language models that “understand” the semantic content held within the symbol sequences, of any language, presented to it?

In this paper, we take a small step towards a model that understands language as a human does by training a neural model jointly on corresponding linguistic and visual data. From an image-captioning dataset, we create a multi-lingual corpus where sentences are mapped to the real-world images they describe. We ask how adding such real-world context at training can improve language model performance. We create a unified multi-modal connectionist architecture that incorporates visual context and uses either -RNN Ororbia II et al. (2017)

, Long Short Term Memory

(LSTM; Hochreiter and Schmidhuber, 1997)

or Gated Recurrent Unit

(GRU; Cho et al., 2014) units. We find that the models acquire more knowledge of language than if they were trained without corresponding, real-world visual context.

2 Related Work

Both behavioral and neuroimaging studies have found considerable evidence for the contribution of perceptual information to linguistic tasks Barsalou (2008). It has long been held that language is acquired jointly with perception through interaction with the environment (e.g. Frank et al., 2008)

. Eye-tracking studies show that visual context influences word recognition and syntactic parsing from even the earliest moments of comprehension

Tanenhaus et al. (1995).

Computational cognitive models can account for bootstrapped learning of word meaning and syntax when language is paired with perceptual experience (Abend et al., 2017) and for the ability of children to rapidly acquire new words by inferring the referent from their physical environment Alishahi et al. (2008). Some distributional semantics models integrate word co-occurrence data with perceptual data, either to achieve a better model of language as it exists in the minds of humans Baroni (2016); Johns and Jones (2012); Kievit-Kylar and Jones (2011); Lazaridou et al. (2014)

or to improve performance on machine learning tasks such as object recognition

Frome et al. (2013); Lazaridou et al. (2015a), image captioning Kiros et al. (2014); Lazaridou et al. (2015b), or image search Socher et al. (2014).

Integrating language and perception can facilitate language acquisition by allowing models to infer how a new word is used from the perceptual features of its referent Johns and Jones (2012) or to allow for fast mapping between a new word and a new object in the environment Lazaridou et al. (2014). Likewise, this integration allows models to infer the perceptual features of an unobserved referent from how a word is used in language Johns and Jones (2012); Lazaridou et al. (2015b). As a result, language data can be used to improve object recognition by providing information about unobserved or infrequently observed objects Frome et al. (2013) or for differentiating objects that often co-occur in photos (e.g., cats and sofas; Lazaridou et al., 2015a).

By representing the referents of concrete nouns as arrangements of elementary visual features Biederman (1987), Kievit-Kylar and Jones (2011) found that the visual features of nouns capture semantic typicality effects, and that a combined representation, consisting of both visual features and word co-occurrence data, more strongly correlates with human judgments of semantic similarity than representations extracted from a corpus alone. While modeling similarity judgments is distinct from the problem of predictive language modeling, we take this finding as evidence that visual perception informs semantics, which suggests there are gains to be had integrating perception with predictive language models.

In contrast to prior work in machine learning, where mappings between vision and language have been examined Kiros et al. (2014); Vinyals et al. (2015); Xu et al. (2015), our goal in integrating visual and linguistic data is not to accomplish a task such as image search/captioning that inherently requires a mapping between these modalities. Rather, our goal is to show that, since perceptual information is intrinsic to how humans process language, a language model that is trained on both visual and linguistic data will be a better model, consistently across languages, than one trained on linguistic data alone.

Due to the ability of language models to constrain predictions on the basis of preceding context, language models play a central role in natural-language and speech processing applications. However, the psycholinguistic questions surrounding how people acquire and use linguistic knowledge are fundamentally different from the aims of machine learning. Using NLP-style language models to address psycholinguistic questions is a new approach that integrates well with the theory of predictive coding in cognitive psychology (Clark, 2013; Rao and Ballard, 1999). For language processing this means that when reading text or comprehending speech, humans constantly anticipate what will be said next. Predictive coding in humans is a fast, implicit cognitive process similar to the kind of sequence learning that recurrent neural models excel at. We do not propose recurrent neural models as direct accounts of human language processing. Instead, our intent is to use a general purpose machine learning algorithm as a tool to investigate the informational characteristics of the language learning task. More specifically, we use machine learning to explore the question as to whether natural languages are most easily learned when situated in an environmental context and grounded in perception.

3 The Multi-modal Neural Architecture

We will evaluate the multi-modal training approach on several well-known complex architectures, including the LSTM, and further examine the effect of using pre-trained BERT embeddings. However, to simply describe the the neural model, we start from the Differential State Framework (DSF; Ororbia II et al., 2017), which unifies gated recurrent architectures under the general view that state memory is a simple parametrized mixture of “fast” and “slow” states. Our aim is to model sequences of symbols, such as the words that compose sentences, where at each time we process

, or the one-hot encoding of a token


One-hot encoding represents tokens as binary-valued vectors with one dimension for each type of token. Only one dimension has a non-zero value, indicating the presence of a token of that type.

Figure 1: Integration of visual information in an unrolled network (here, the MM--RNN. Grey-dashed: identity connections; black-dash-dotted: next-step predictions; solid-back lines: weight matrices.

One of the simplest models that can be derived from the DSF is the -RNN (Ororbia II et al., 2017). A -RNN is a simple gated RNN that captures longer-term dependencies in sequences through the use of a parametrized, flexible state “mixing” function. The model computes a new state at a given time step by comparing a fast state (which is proposed after accounting for the current token) and a slow state (a form of long-term memory). The model is defined by parameters (input-to-hidden weights , recurrent weights , gating-control coefficients , and the rate-gate bias ). Inference is defined as:


where is the 1-of-k encoding of the word at time . Note that

are learnable bias vectors that modulate internal multiplicative interactions. The rate gate

controls how slow and fast-moving memory states are mixed inside the model. In contrast to the model originally trained in ororbia2017learning, the outer activation is the linear rectifier,

, instead of the identity or hyperbolic tangent, because we found that it worked much better. The inner activation function

is .

To integrate visual context information into the -RNN, we fuse the model with a neural vision system, motivated by work done in automated image captioning (Xu et al., 2015)

. We adopt a transfer learning approach and incorporate a state-of-the-art convolutional neural network into the

-RNN model, namely the Inception-v3 network (Szegedy et al., 2016)222

In preliminary experiments, we also examined VGGNet and a few others, but found that the Inception worked the best when it came to acquiring more general distributed representations of natural images.

, in order to create a multi-modal -RNN model (MM--RNN; see Figure 1). Since our focus is on language modeling, the parameters of the vision network are fixed.

To obtain a distributed representation of an image from the Inception-v3 network, we extract the vector produced from the final max-pooling layer,

, after running an image through the model (note that this operation occurs right before the final, fully-connected processing layers which are usually task-specific parameters, such as in object classification). The -RNN can make use of the information in this visual context vector if we modify its state computation in one of two ways. The first way would be to modify the inner state to be a linear combination of the data-dependent pre-activation, the filtration, and a learned linear mapping of as follows:


where is a learnable synaptic connections matrix that connects the visual context representation with the inner state. The second way to modify the -RNN would be change its outer mixing function instead:


Here in Equation 8 we see the linearly-mapped visual context embedding interacts with the currently computation state through a multiplicative operation, allowing the visual-context to persist and work in a longer-term capacity. In either situation, using a parameter matrix frees us from having to set the dimensionality of the hidden state to be the same as the context vector produced by the Inception-v3 network.

We do not use regularization techniques with this model. The application of regularization techniques is, in principle, possible (and typically improves performance of the -RNN), but it is damaging to performance in this particular case, where an already compressed and regularized representation of the images from Inception-v3 serves as input to the multi-modal language modeling network.

Let be a variable-length sequence of words corresponding to an image . In general, the distribution over the variables follows the graphical model:

For all model variants the state

calculated at any time step is fed into a maximum-entropy classifier

333Bias term omitted for clarity. defined as:

The model parameters optimized with respect to the sequence negative log likelihood:

We differentiate with respect to this cost function to calculate gradients.

3.1 GRU, LSTM and BERT variants

Does visually situated language learning benefit from the specific architecture of the -RNN, or does the proposal work with state-of-the-art language models? We applied the same architecture to Gated Recurrent Units (GRU, Cho et al., 2014), Long Short Term Memory (LSTM, Hochreiter and Schmidhuber, 1997), and BERT (Devlin et al., 2018). We train these models on text alone and compare to the two variations of the multi-modal -RNN, as described in the previous section. The multi-modal GRU, with context information directly integrated, is defined as follows:

where we note the parameter matrix that maps the visual context into the GRU state effectively gates the outer function.444In preliminary experiments, we tried both methods of integration, Equations 7 and 8. We ultimately found the second formulation to give better performance. The multi-modal variant of the LSTM (with peephole connections) is defined as follows:

We furthermore created one more variant of each multi-modal RNN by initializing a portion of their input-to-hidden weights with embeddings extracted from the Bidirectional Encoder Representations from Transformers (BERT) model Devlin et al. (2018). This would correspond to initializing in the -RNN, in the LSTM, and in the GRU. Note that in our results, we only report the best-performing model, which turned out to be the LSTM variant. Since the models in this work are at the word level and BERT operates at the subword level, we create initial word embeddings by first decomposing each word into its appropriate subword components, according to the WordPieces model Wu et al. (2016), and then extract the relevant BERT representation for each. For each subword token, a representation is created by summing together a specific learned token embedding, a segmentation embedding, and a position embedding. For a target word, we linearly combine subword input representations and initialize the relevant weight with this final embedding.

4 Experiments

(a) English -RNNs.
(b) German -RNNs.
(c) Spanish -RNNs.
Figure 2: Training -RNNs in each language (English, German, Spanish). Baseline model is trained and evaluated on language (L-L), the full model uses the multi-modal signal (LV-LV), and the target model is trained on LV, but evaluated on L only (LV-L).

The experiments in this paper were conducted using the MS-COCO image-captioning dataset.555 Each image in the dataset has five captions provided by human annotators. We use the captions to create five different ground truth splits. We translated each ground truth split into German and Spanish using the Google Translation API, which was chosen as a state-of-the-art, independently evaluated MT tool that produces, according to our inspection of the results, idiomatic, and syntactically and semantically faithful translations. To our knowledge, this represents the first Multi-lingual MSCOCO dataset on situated learning. We tokenize the corpus into words and obtain a 16.6K vocabulary for English, 33.2K for German and 18.2k for Spanish.

English German MT Spanish MT
Model (Type) Test-NLL Test-PPL Test-NLL Test-PPL Test-NLL Test-PPL
-RNN (L-L)
Table 1: Generalization performance as measured by negative log likelihood (NLL) and perplexity (PPL). Lower values indicate better performance. Baseline model (L-L) trained and evaluated on linguistic data only. Full model (LV-LV) trained and evaluated on both linguistic and visual data. Blind model (LV-L) trained on both but evaluated on language only. The difference between L-L and LV-L illustrates the performance improvement. German and Spanish data are machine-translated (MT) and provide additional, but correlated, evidence. For comparison, Devlin et al. (2018) report a perplexity of for their (broad) English test data, using the same base model we use here to define input representations.

As our primary concern is the next-step prediction of words/tokens, we use negative log likelihood and perplexity to evaluate the models. This is different from the goals of machine translation or image captioning, which, in most cases, is concerned with a ranking of possible captions where one measures how similar the model’s generated sequences are to ground-truth target phrases.

Baseline results were obtained with neural language models trained on text alone. For the -RNN, this meant implementing a model using only Equations 1-7. The best results were achieved using the BERT Large model (bidirectional Transformer, 24 layers, 1024dims, 16 attention heads: Devlin et al. 2018). We used the large pretrained model and then trained with visual context.

All models were trained to minimize the sequence loss of the sentences in the training split. The weight matrices of all models were initialized from uniform distribution,

, biases were initialized from zero, and the -RNN-specific biases

were all initialized to one. Parameter updates calculated through back-propagation through time required unrolling the model over 49 steps in time (this length was determined based on validation set likelihood). All symbol sequences were zero-padded and appropriately masked to ensure efficient mini-batching. Gradients were hard-clipped at a magnitude bound of

. Over mini-batches of 32 samples, model parameters were optimized using simple stochastic gradient descent (learning rate

which was halved if the perplexity, measured at the end of each epoch, goes up three or more times).

To determine if our multi-modal language models capture knowledge that is different from a text-only language model, we evaluate each model twice. First, we compute the model perplexity on the test set using the sentences’ visual context vectors. Next, we compute model perplexity on test sentences by feeding in a null-vector to the multi-modal model as the visual context. If the model did truly pick up some semantic knowledge that is not exclusively dependent on the context vector, its perplexity in the second setting, while naturally worse than the first setting, should still outperform text-only baselines.

Ocean Kite Subway Racket
surfing boats plane kites train railroad bat bat
sandy beach kites airplane passenger train batter players
filled pier airplane plane railroad locomotive catcher batter
beach wetsuit surfboard airplanes trains trains skateboard swing
market cloth planes planes gas steam umpire catcher
crowded surfing airplanes airliner commuter gas soccer hitter
topped windsurfing boats helicopter trolley commuter women ball
plays boardwalk jet jets locomotive passenger pedestrians umpire
cross flying aircraft biplane steam crowded players tennis
snowy biplane jets jet it’s trolley uniform tatoos
Table 2: The ten words most closely related to the bolded query word, rank ordered, trained with (MM--RNN) and without (-RNN) visual input.

In Table 1, we report each model’s negative log likelihood (NLL) and per-word perplexity (PPL). PPL is calculated as:

We observe that in all cases the multi-modal models outperform their respective text-only baselines. More importantly, the multi-modal models, when evaluated without the Inception-v3 representations on holdout samples, still perform better than the text-only baselines. The improvement in language generalization can be attributed to the visual context information provided during training, enriching its representations over word sequences with knowledge of actual objects and actions.

Figure 2 shows the validation perplexity of the -RNN on each language as a function of the first 15 epochs of learning. We observe that throughout learning, the improvement in generalization afforded by the visual context is persistent. Validation performance was also tracked for the various GRU and LSTM models, where the same trend was also observed.

4.1 Model Analysis

We analyze the decoders of text-only and multi-modal models. We examine the parameter matrix , which is directly involved in calculating the predictions of the underlying generative model. can be thought of as “transposed embeddings”, an idea that has also been exploited to introduce further regularization into the neural language model learning process (Press and Wolf, 2016; Inan et al., 2016)

. If we treat each row of this matrix as the learned embedding for a particular word (we assume column-major orientation in implementation), we can calculate its proximity to other embeddings using cosine similarity.

Table 2 shows the top ten words for several randomly selected query terms using the decoder parameter matrix. By observing the different sets of nearest-neighbors produced by the -RNN and the multi-modal -RNN (MM--RNN), we can see that the MM--RNN appears to have learned to combine the information from the visual context with the token sequence in its representations. For example, for the query “ocean”, we see that while the -RNN does associate some relevant terms, such as “surfing” and “beach”, it also associates terms with marginal relevance to “ocean” such as “market” and “plays”. Conversely, nearly all of the terms the MM--RNN associates with “ocean” are relevant to the query. The same is true for “kite” and “subway”. For “racket”, while the text-only baseline mostly associates the query with sports terms, especially sports equipment like “bat”, the MM--RNN is able to relate the query to the correct sport, “tennis”.

4.2 Conditional Sampling

To see how visual context influences the language model, we sample the conditional generative model. Beam search (size ) allows us to generate full sentences (Table 3

). Words were ranked based on model probabilities.

a skateboarder and person in front of skyscrapers. a person with skateboarder on air. a person doing a trick with skateboarder. a person with camera with blue background.
a food bowl on the table a bowl full of food on the table a green and red bowl on the table a salad bowl with chicken
a dog on blue bed with blanket. a dog sleeps near wooden table. a dog sleeps on a bed. a dog on some blue blankets.
Table 3: Some captions generated by the multi-modal -RNN in English.

5 Discussion and Conclusions

Perceptual context improves training multi-modal neural models compared to training on language alone. Specifically, we find that augmenting a predictive language model with images that illustrate the sentences being learned enhances its next-word prediction ability. The performance improvement persists even in situations devoid of visual input, when the model is being used as a pure language model.

The near state-of-the-art language model, using BERT, reflects the case of human language acquisition less than do the other models, which were trained “ab initio” in a situated context. BERT is pre-trained on a very large corpus, but it still picked up a performance improvement when fine-tuned on the visual context and language, as compared to the corpus language signal alone. We do not expect this to be a ceiling for visual augmentation: in the world of training LMs, the MS COCO corpus is, of course, a small dataset.

Neural language models, as used here, are contenders as cognitive and psycholinguistic models of the non-symbolic, implicit aspects of language representation. There is a great deal of evidence that something like a predictive language model exists in the human mind. The surprisal of a word or phrase refers to the degree of mismatch between what a human listener expected to be said next and what is actually said, for example, when a garden path sentence forces the listener to abandon a partial, incremental parse (Ferreira and Henderson, 1991; Hale, 2001). In the garden path sentence “The horse raced past the barn fell”, the final word “fell” forces the reader to revise their initial interpretation of “raced” as the active verb Bever (1970). More generally, the idea of predictive coding holds that the mind forms expectations before perception occurs (see Clark, 2013, for a review). How these predictions are formed is unclear. Predictive language models trained with a generic neural architecture, without specific linguistic universals, are a reasonable candidate for a model of predictive coding in language. This does not imply neuropsychological realism of the low-level representations or learning algorithms, and we cannot advocate for a specific neural architecture as being most plausible. However, we can show that an architecture that predicts linguistic input well learns better when its input mimics that of a human language learner.

A cognitive model of language processing might distinguish between symbolic language knowledge and processes that implement compositionality to produce semantics on the one hand, and implicit processes that leverage sequences and associations to produce expectations. With respect to acquiring the latter, implicit and predictive model, we note that children are exposed to a rich sensory environment, one more detailed than the environment provided to our model here. If even static visual input alone improves language acquisition, then what could a sensorily rich environment achieve? When a multi-modal learner is considered, then, perhaps, the language acquisition stimulus that has been famously labeled to be rather poor (Chomsky, 1959; Berwick et al., 2013), is not so poor after all.


We would like to thank Tomas Mikolov, Emily Pitler, Saranya Venkatraman, and Zixin Tang for comments. Part of this work was funded by the National Science Foundation (BCS-1734304 to D. Reitter).