The theory of situated cognition postulates that a person’s knowledge is inseparable from the physical or social context in which it is learned and used (Greeno and Moore, 1993). Similarly, Perceptual Symbol Systems theory holds that all of cognition, thought, language, reasoning, and memory, is grounded in perceptual features Barsalou (1999). Knowledge of language cannot be separated from its physical context, which allows words and sentences to be learned by grounding them in reference to objects or natural concepts on hand (see Roy and Reiter, 2005, for a review). Nor can knowledge of language be separated from its social context, where language is learned interactively through communicating with others to facilitate problem-solving. Simply put, language does not occur in a vacuum.
Yet, statistical language models, typically connectionist systems, are often trained in such a vacuum. Sequences of symbols, such as sentences or phrases composed of words in any language, such as English or German, are often fed into the model independently of any real-world context they might describe. In the classical language modeling framework, a model learns to predict a word based on a history of words it has seen so far. While these models learn a great deal of linguistic structure from these symbol sequences alone, acquiring the essence of basic syntax, it is highly unlikely that this approach can create models that acquire much in terms of semantics or pragmatics, which are integral to the human experience of language. How might one build neural language models that “understand” the semantic content held within the symbol sequences, of any language, presented to it?
In this paper, we take a small step towards a model that understands language as a human does by training a neural model jointly on corresponding linguistic and visual data. From an image-captioning dataset, we create a multi-lingual corpus where sentences are mapped to the real-world images they describe. We ask how adding such real-world context at training can improve language model performance. We create a unified multi-modal connectionist architecture that incorporates visual context and uses either -RNN Ororbia II et al. (2017)1997)2014) units. We find that the models acquire more knowledge of language than if they were trained without corresponding, real-world visual context.
2 Related Work
Both behavioral and neuroimaging studies have found considerable evidence for the contribution of perceptual information to linguistic tasks Barsalou (2008). It has long been held that language is acquired jointly with perception through interaction with the environment (e.g. Frank et al., 2008)
. Eye-tracking studies show that visual context influences word recognition and syntactic parsing from even the earliest moments of comprehensionTanenhaus et al. (1995).
Computational cognitive models can account for bootstrapped learning of word meaning and syntax when language is paired with perceptual experience (Abend et al., 2017) and for the ability of children to rapidly acquire new words by inferring the referent from their physical environment Alishahi et al. (2008). Some distributional semantics models integrate word co-occurrence data with perceptual data, either to achieve a better model of language as it exists in the minds of humans Baroni (2016); Johns and Jones (2012); Kievit-Kylar and Jones (2011); Lazaridou et al. (2014)
or to improve performance on machine learning tasks such as object recognitionFrome et al. (2013); Lazaridou et al. (2015a), image captioning Kiros et al. (2014); Lazaridou et al. (2015b), or image search Socher et al. (2014).
Integrating language and perception can facilitate language acquisition by allowing models to infer how a new word is used from the perceptual features of its referent Johns and Jones (2012) or to allow for fast mapping between a new word and a new object in the environment Lazaridou et al. (2014). Likewise, this integration allows models to infer the perceptual features of an unobserved referent from how a word is used in language Johns and Jones (2012); Lazaridou et al. (2015b). As a result, language data can be used to improve object recognition by providing information about unobserved or infrequently observed objects Frome et al. (2013) or for differentiating objects that often co-occur in photos (e.g., cats and sofas; Lazaridou et al., 2015a).
By representing the referents of concrete nouns as arrangements of elementary visual features Biederman (1987), Kievit-Kylar and Jones (2011) found that the visual features of nouns capture semantic typicality effects, and that a combined representation, consisting of both visual features and word co-occurrence data, more strongly correlates with human judgments of semantic similarity than representations extracted from a corpus alone. While modeling similarity judgments is distinct from the problem of predictive language modeling, we take this finding as evidence that visual perception informs semantics, which suggests there are gains to be had integrating perception with predictive language models.
In contrast to prior work in machine learning, where mappings between vision and language have been examined Kiros et al. (2014); Vinyals et al. (2015); Xu et al. (2015), our goal in integrating visual and linguistic data is not to accomplish a task such as image search/captioning that inherently requires a mapping between these modalities. Rather, our goal is to show that, since perceptual information is intrinsic to how humans process language, a language model that is trained on both visual and linguistic data will be a better model, consistently across languages, than one trained on linguistic data alone.
Due to the ability of language models to constrain predictions on the basis of preceding context, language models play a central role in natural-language and speech processing applications. However, the psycholinguistic questions surrounding how people acquire and use linguistic knowledge are fundamentally different from the aims of machine learning. Using NLP-style language models to address psycholinguistic questions is a new approach that integrates well with the theory of predictive coding in cognitive psychology (Clark, 2013; Rao and Ballard, 1999). For language processing this means that when reading text or comprehending speech, humans constantly anticipate what will be said next. Predictive coding in humans is a fast, implicit cognitive process similar to the kind of sequence learning that recurrent neural models excel at. We do not propose recurrent neural models as direct accounts of human language processing. Instead, our intent is to use a general purpose machine learning algorithm as a tool to investigate the informational characteristics of the language learning task. More specifically, we use machine learning to explore the question as to whether natural languages are most easily learned when situated in an environmental context and grounded in perception.
3 The Multi-modal Neural Architecture
We will evaluate the multi-modal training approach on several well-known complex architectures, including the LSTM, and further examine the effect of using pre-trained BERT embeddings. However, to simply describe the the neural model, we start from the Differential State Framework (DSF; Ororbia II et al., 2017), which unifies gated recurrent architectures under the general view that state memory is a simple parametrized mixture of “fast” and “slow” states. Our aim is to model sequences of symbols, such as the words that compose sentences, where at each time we process
, or the one-hot encoding of a token111
One-hot encoding represents tokens as binary-valued vectors with one dimension for each type of token. Only one dimension has a non-zero value, indicating the presence of a token of that type.
One of the simplest models that can be derived from the DSF is the -RNN (Ororbia II et al., 2017). A -RNN is a simple gated RNN that captures longer-term dependencies in sequences through the use of a parametrized, flexible state “mixing” function. The model computes a new state at a given time step by comparing a fast state (which is proposed after accounting for the current token) and a slow state (a form of long-term memory). The model is defined by parameters (input-to-hidden weights , recurrent weights , gating-control coefficients , and the rate-gate bias ). Inference is defined as:
where is the 1-of-k encoding of the word at time . Note that
are learnable bias vectors that modulate internal multiplicative interactions. The rate gatecontrols how slow and fast-moving memory states are mixed inside the model. In contrast to the model originally trained in ororbia2017learning, the outer activation is the linear rectifier,
, instead of the identity or hyperbolic tangent, because we found that it worked much better. The inner activation functionis .
To integrate visual context information into the -RNN, we fuse the model with a neural vision system, motivated by work done in automated image captioning (Xu et al., 2015)-RNN model, namely the Inception-v3 network (Szegedy et al., 2016)222
In preliminary experiments, we also examined VGGNet and a few others, but found that the Inception worked the best when it came to acquiring more general distributed representations of natural images., in order to create a multi-modal -RNN model (MM--RNN; see Figure 1). Since our focus is on language modeling, the parameters of the vision network are fixed.
To obtain a distributed representation of an image from the Inception-v3 network, we extract the vector produced from the final max-pooling layer,, after running an image through the model (note that this operation occurs right before the final, fully-connected processing layers which are usually task-specific parameters, such as in object classification). The -RNN can make use of the information in this visual context vector if we modify its state computation in one of two ways. The first way would be to modify the inner state to be a linear combination of the data-dependent pre-activation, the filtration, and a learned linear mapping of as follows:
where is a learnable synaptic connections matrix that connects the visual context representation with the inner state. The second way to modify the -RNN would be change its outer mixing function instead:
Here in Equation 8 we see the linearly-mapped visual context embedding interacts with the currently computation state through a multiplicative operation, allowing the visual-context to persist and work in a longer-term capacity. In either situation, using a parameter matrix frees us from having to set the dimensionality of the hidden state to be the same as the context vector produced by the Inception-v3 network.
We do not use regularization techniques with this model. The application of regularization techniques is, in principle, possible (and typically improves performance of the -RNN), but it is damaging to performance in this particular case, where an already compressed and regularized representation of the images from Inception-v3 serves as input to the multi-modal language modeling network.
Let be a variable-length sequence of words corresponding to an image . In general, the distribution over the variables follows the graphical model:
For all model variants the state
calculated at any time step is fed into a maximum-entropy classifier333Bias term omitted for clarity. defined as:
The model parameters optimized with respect to the sequence negative log likelihood:
We differentiate with respect to this cost function to calculate gradients.
3.1 GRU, LSTM and BERT variants
Does visually situated language learning benefit from the specific architecture of the -RNN, or does the proposal work with state-of-the-art language models? We applied the same architecture to Gated Recurrent Units (GRU, Cho et al., 2014), Long Short Term Memory (LSTM, Hochreiter and Schmidhuber, 1997), and BERT (Devlin et al., 2018). We train these models on text alone and compare to the two variations of the multi-modal -RNN, as described in the previous section. The multi-modal GRU, with context information directly integrated, is defined as follows:
where we note the parameter matrix that maps the visual context into the GRU state effectively gates the outer function.444In preliminary experiments, we tried both methods of integration, Equations 7 and 8. We ultimately found the second formulation to give better performance. The multi-modal variant of the LSTM (with peephole connections) is defined as follows:
We furthermore created one more variant of each multi-modal RNN by initializing a portion of their input-to-hidden weights with embeddings extracted from the Bidirectional Encoder Representations from Transformers (BERT) model Devlin et al. (2018). This would correspond to initializing in the -RNN, in the LSTM, and in the GRU. Note that in our results, we only report the best-performing model, which turned out to be the LSTM variant. Since the models in this work are at the word level and BERT operates at the subword level, we create initial word embeddings by first decomposing each word into its appropriate subword components, according to the WordPieces model Wu et al. (2016), and then extract the relevant BERT representation for each. For each subword token, a representation is created by summing together a specific learned token embedding, a segmentation embedding, and a position embedding. For a target word, we linearly combine subword input representations and initialize the relevant weight with this final embedding.
The experiments in this paper were conducted using the MS-COCO image-captioning dataset.555https://competitions.codalab.org/competitions/3221 Each image in the dataset has five captions provided by human annotators. We use the captions to create five different ground truth splits. We translated each ground truth split into German and Spanish using the Google Translation API, which was chosen as a state-of-the-art, independently evaluated MT tool that produces, according to our inspection of the results, idiomatic, and syntactically and semantically faithful translations. To our knowledge, this represents the first Multi-lingual MSCOCO dataset on situated learning. We tokenize the corpus into words and obtain a 16.6K vocabulary for English, 33.2K for German and 18.2k for Spanish.
|English||German MT||Spanish MT|
As our primary concern is the next-step prediction of words/tokens, we use negative log likelihood and perplexity to evaluate the models. This is different from the goals of machine translation or image captioning, which, in most cases, is concerned with a ranking of possible captions where one measures how similar the model’s generated sequences are to ground-truth target phrases.
Baseline results were obtained with neural language models trained on text alone. For the -RNN, this meant implementing a model using only Equations 1-7. The best results were achieved using the BERT Large model (bidirectional Transformer, 24 layers, 1024dims, 16 attention heads: Devlin et al. 2018). We used the large pretrained model and then trained with visual context.
All models were trained to minimize the sequence loss of the sentences in the training split. The weight matrices of all models were initialized from uniform distribution,, biases were initialized from zero, and the -RNN-specific biases
were all initialized to one. Parameter updates calculated through back-propagation through time required unrolling the model over 49 steps in time (this length was determined based on validation set likelihood). All symbol sequences were zero-padded and appropriately masked to ensure efficient mini-batching. Gradients were hard-clipped at a magnitude bound of
. Over mini-batches of 32 samples, model parameters were optimized using simple stochastic gradient descent (learning rate
which was halved if the perplexity, measured at the end of each epoch, goes up three or more times).
To determine if our multi-modal language models capture knowledge that is different from a text-only language model, we evaluate each model twice. First, we compute the model perplexity on the test set using the sentences’ visual context vectors. Next, we compute model perplexity on test sentences by feeding in a null-vector to the multi-modal model as the visual context. If the model did truly pick up some semantic knowledge that is not exclusively dependent on the context vector, its perplexity in the second setting, while naturally worse than the first setting, should still outperform text-only baselines.
In Table 1, we report each model’s negative log likelihood (NLL) and per-word perplexity (PPL). PPL is calculated as:
We observe that in all cases the multi-modal models outperform their respective text-only baselines. More importantly, the multi-modal models, when evaluated without the Inception-v3 representations on holdout samples, still perform better than the text-only baselines. The improvement in language generalization can be attributed to the visual context information provided during training, enriching its representations over word sequences with knowledge of actual objects and actions.
Figure 2 shows the validation perplexity of the -RNN on each language as a function of the first 15 epochs of learning. We observe that throughout learning, the improvement in generalization afforded by the visual context is persistent. Validation performance was also tracked for the various GRU and LSTM models, where the same trend was also observed.
4.1 Model Analysis
We analyze the decoders of text-only and multi-modal models. We examine the parameter matrix , which is directly involved in calculating the predictions of the underlying generative model. can be thought of as “transposed embeddings”, an idea that has also been exploited to introduce further regularization into the neural language model learning process (Press and Wolf, 2016; Inan et al., 2016)
. If we treat each row of this matrix as the learned embedding for a particular word (we assume column-major orientation in implementation), we can calculate its proximity to other embeddings using cosine similarity.
Table 2 shows the top ten words for several randomly selected query terms using the decoder parameter matrix. By observing the different sets of nearest-neighbors produced by the -RNN and the multi-modal -RNN (MM--RNN), we can see that the MM--RNN appears to have learned to combine the information from the visual context with the token sequence in its representations. For example, for the query “ocean”, we see that while the -RNN does associate some relevant terms, such as “surfing” and “beach”, it also associates terms with marginal relevance to “ocean” such as “market” and “plays”. Conversely, nearly all of the terms the MM--RNN associates with “ocean” are relevant to the query. The same is true for “kite” and “subway”. For “racket”, while the text-only baseline mostly associates the query with sports terms, especially sports equipment like “bat”, the MM--RNN is able to relate the query to the correct sport, “tennis”.
4.2 Conditional Sampling
To see how visual context influences the language model, we sample the conditional generative model. Beam search (size ) allows us to generate full sentences (Table 3
). Words were ranked based on model probabilities.
|a skateboarder and person in front of skyscrapers. a person with skateboarder on air. a person doing a trick with skateboarder. a person with camera with blue background.|
|a food bowl on the table a bowl full of food on the table a green and red bowl on the table a salad bowl with chicken|
|a dog on blue bed with blanket. a dog sleeps near wooden table. a dog sleeps on a bed. a dog on some blue blankets.|
5 Discussion and Conclusions
Perceptual context improves training multi-modal neural models compared to training on language alone. Specifically, we find that augmenting a predictive language model with images that illustrate the sentences being learned enhances its next-word prediction ability. The performance improvement persists even in situations devoid of visual input, when the model is being used as a pure language model.
The near state-of-the-art language model, using BERT, reflects the case of human language acquisition less than do the other models, which were trained “ab initio” in a situated context. BERT is pre-trained on a very large corpus, but it still picked up a performance improvement when fine-tuned on the visual context and language, as compared to the corpus language signal alone. We do not expect this to be a ceiling for visual augmentation: in the world of training LMs, the MS COCO corpus is, of course, a small dataset.
Neural language models, as used here, are contenders as cognitive and psycholinguistic models of the non-symbolic, implicit aspects of language representation. There is a great deal of evidence that something like a predictive language model exists in the human mind. The surprisal of a word or phrase refers to the degree of mismatch between what a human listener expected to be said next and what is actually said, for example, when a garden path sentence forces the listener to abandon a partial, incremental parse (Ferreira and Henderson, 1991; Hale, 2001). In the garden path sentence “The horse raced past the barn fell”, the final word “fell” forces the reader to revise their initial interpretation of “raced” as the active verb Bever (1970). More generally, the idea of predictive coding holds that the mind forms expectations before perception occurs (see Clark, 2013, for a review). How these predictions are formed is unclear. Predictive language models trained with a generic neural architecture, without specific linguistic universals, are a reasonable candidate for a model of predictive coding in language. This does not imply neuropsychological realism of the low-level representations or learning algorithms, and we cannot advocate for a specific neural architecture as being most plausible. However, we can show that an architecture that predicts linguistic input well learns better when its input mimics that of a human language learner.
A cognitive model of language processing might distinguish between symbolic language knowledge and processes that implement compositionality to produce semantics on the one hand, and implicit processes that leverage sequences and associations to produce expectations. With respect to acquiring the latter, implicit and predictive model, we note that children are exposed to a rich sensory environment, one more detailed than the environment provided to our model here. If even static visual input alone improves language acquisition, then what could a sensorily rich environment achieve? When a multi-modal learner is considered, then, perhaps, the language acquisition stimulus that has been famously labeled to be rather poor (Chomsky, 1959; Berwick et al., 2013), is not so poor after all.
We would like to thank Tomas Mikolov, Emily Pitler, Saranya Venkatraman, and Zixin Tang for comments. Part of this work was funded by the National Science Foundation (BCS-1734304 to D. Reitter).
- Abend et al. (2017) Omri Abend, Tom Kwiatkowski, Nathaniel J. Smith, Sharon Goldwater, and Mark Steedman. 2017. Bootstrapping language acquisition. Cognition, 164:116 – 143.
- Alishahi et al. (2008) Afra Alishahi, Afsaneh Fazly, and Suzanne Stevenson. 2008. Fast mapping in word learning: What probabilities tell us. In Proceedings of the Twelfth Conference on Computational Natural Language Learning, pages 57–64. Association for Computational Linguistics.
- Baroni (2016) Marco Baroni. 2016. Grounding distributional semantics in the visual world. Language and Linguistics Compass, 10(1):3–13.
- Barsalou (1999) Lawrence W Barsalou. 1999. Perceptions of perceptual symbols. Behavioral and Brain Sciences, 22(4):637–660.
- Barsalou (2008) Lawrence W Barsalou. 2008. Grounded cognition. Annual Review of Psychology, 59:617–645.
- Berwick et al. (2013) Robert C Berwick, Noam Chomsky, and Massimo Piattelli-Palmarini. 2013. Poverty of the stimulus stands: Why recent challenges fail. In Rich Languages From Poor Inputs, chapter 1, pages 19–42. Oxford University Press.
- Bever (1970) Thomas G Bever. 1970. The cognitive basis for linguistic structures. In Cognition and the development of language, pages 279–362.
- Biederman (1987) Irving Biederman. 1987. Recognition-by-components: A theory of human image understanding. Psychological Review, 94(2):115.
- Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
- Chomsky (1959) Noam Chomsky. 1959. A review of BF Skinner’s verbal behavior. Language, 35(1):26–58.
- Clark (2013) Andy Clark. 2013. Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behavioral and brain sciences, 36(3):181–204.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Ferreira and Henderson (1991) Fernanda Ferreira and John M Henderson. 1991. Recovery from misanalyses of garden-path sentences. Journal of Memory and Language, 30(6):725–745.
- Frank et al. (2008) Michael C Frank, Noah D Goodman, and Joshua B Tenenbaum. 2008. A Bayesian framework for cross-situational word-learning. In Advances in neural information processing systems, pages 457–464.
- Frome et al. (2013) Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. 2013. Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems, pages 2121–2129.
- Greeno and Moore (1993) James G Greeno and Joyce L Moore. 1993. Situativity and symbols: Response to Vera and Simon. Cognitive Science, 17(1):49–59.
- Hale (2001) John Hale. 2001. A probabilistic Earley parser as a psycholinguistic model. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies, pages 1–8, Pittsburgh, PA.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780.
- Inan et al. (2016) Hakan Inan, Khashayar Khosravi, and Richard Socher. 2016. Tying word vectors and word classifiers: A loss framework for language modeling. arXiv preprint arXiv:1611.01462.
- Johns and Jones (2012) Brendan T Johns and Michael N Jones. 2012. Perceptual inference through global lexical similarity. Topics in Cognitive Science, 4(1):103–120.
- Kievit-Kylar and Jones (2011) Brent Kievit-Kylar and Michael Jones. 2011. The semantic pictionary project. In Proceedings of the 33rd Annual Conference of the Cognitive Science Society, pages 2229–2234, Austin, TX. Cognitive Science Society.
- Kiros et al. (2014) Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539.
- Lazaridou et al. (2014) Angeliki Lazaridou, Elia Bruni, and Marco Baroni. 2014. Is this a wampimuk? cross-modal mapping between distributional semantics and the visual world. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1403–1414.
- Lazaridou et al. (2015a) Angeliki Lazaridou, Georgiana Dinu, Adam Liska, and Marco Baroni. 2015a. From visual attributes to adjectives through decompositional distributional semantics. Transactions of the Association for Computational Linguistics, 3:183–196.
- Lazaridou et al. (2015b) Angeliki Lazaridou, Nghia The Pham, and Marco Baroni. 2015b. Combining language and vision with a multimodal skip-gram model. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 153–163, Denver, Colorado. Association for Computational Linguistics.
- Ororbia II et al. (2017) Alexander G. Ororbia II, Tomas Mikolov, and David Reitter. 2017. Learning simpler language models with the differential state framework. Neural Computation, 29(12):3327–3352.
- Press and Wolf (2016) Ofir Press and Lior Wolf. 2016. Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859.
- Rao and Ballard (1999) Rajesh PN Rao and Dana H Ballard. 1999. Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects. Nature Neuroscience, 2(1):79.
- Roy and Reiter (2005) Deb Roy and Ehud Reiter. 2005. Connecting language to the world. Artificial Intelligence, 167(1-2):1–12.
- Socher et al. (2014) Richard Socher, Andrej Karpathy, Quoc V Le, Christopher D Manning, and Andrew Y Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association of Computational Linguistics, 2(1):207–218.
Szegedy et al. (2016)
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew
Rethinking the inception architecture for computer vision.In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826.
- Tanenhaus et al. (1995) MK Tanenhaus, MJ Spivey-Knowlton, KM Eberhard, and JC Sedivy. 1995. Integration of visual and linguistic information in spoken language comprehension. Science, 268(5217):1632–1634.
- Vinyals et al. (2015) Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 3156–3164. IEEE.
- Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
- Xu et al. (2015) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning, pages 2048–2057.