A key way for machines to exhibit intelligence is for them to be able to perceive the world around them – and to be able to communicate with humans in natural language about that world. To speak naturally with humans it is necessary that they understand the natural things that humans say about the world they live in, and to respond in kind. This involves understanding what they perceive, e.g. the images they see, what they mean semantically for humans, and how human personality shapes the language and conversations derived from these observations.
In this work we take a step towards these goals by considering the setting of grounded dialogue based on a given image, a setting naturally engaging to humans (Hu et al., 2014). We collect a large set of human-human crowdworker conversations, with the aim of training a model to engage a human in a similar fashion. To that end, the crowdworkers are asked to converse while emulating given personality traits that were chosen in advance (e.g., optimistic, skeptical or frivolous, with 215 different possible choices). Our dataset, Image-Chat, consists of 202k diverse images with resulting personality-conditioned dialogues, yielding 401k utterances over the images. This gives a rich source to train models where the grounding on image and conditioning on personality can be controlled to produce engaging conversations with humans. The dataset is made publicly available in ParlAI (http://parl.ai).
We extend the TransResNet model of Shuster et al. (2018) to handle multimodal dialogue which uses Transformer architectures for encoding dialogue history and responses, ResNet architectures for encoding images, and additional layers to encode personality. We propose ways to fuse those modalities together and perform a detailed study on our new dataset, including both automatic evaluations, ablations and human evaluations of our models using crowdworkers. The results indicate it is possible to produce engaging grounded conversations in this setting, with our best model being preferred to human conversationalists 47.7% of the time.
2 Related Work
The majority of work in dialogue is not grounded in perception, e.g. much recent work explores sequence-to-sequence models or retrieval models for goal-directed (Henderson et al., 2014; Bordes et al., 2017) or chit-chat tasks (Vinyals and Le, 2015; Sordoni et al., 2015; Zhang et al., 2018). While these tasks are text-based only, many of the techniques developed can likely be transferred for use in multimodal systems, for example using state-of-the-art transformer representations for text Mazaré et al. (2018) as a sub-component.
In the area of language and vision, one of the most widely studied areas is image captioning, which involves a single turn utterance given an image. This typically involves producing a descriptive sentence describing the input image, in contrast to producing a conversational utterance as in dialogue. Popular datasets include COCO (Chen et al., 2015) and Flickr30k (Young et al., 2014). Again, a variety of sequence-to-sequence (Vinyals et al., 2015; Xu et al., 2015; Anderson et al., 2018) and retrieval models (Gu et al., 2017; Faghri et al., 2017; Nam et al., 2016) have been applied. These tasks measure the ability of models to understand the content of an image, but not to carry out an engaging conversation grounded in perception. Some works have extended image captioning from being purely factual towards more engaging captions by incorporating style and personality while still being single turn, e.g. (Chandrasekaran et al., 2017; Mathews et al., 2018, 2016; Gan et al., 2017). In particular, the work of Shuster et al. (2018) builds a large dataset involving personality-based image captions. Our work builds upon this dataset and extends it to multi-turn dialogue.
|A: Stylish B: Fatalistic||A: Fearful B: Miserable||A: Money-Minded B: Glamorous|
|A: Riding a mechanical bull in a skirt is just my style.||A: I just heard something out there and I have no idea what it was.||A: You know money doesn’t grow on trees.|
B: You’d probably fall off and get hurt.
|B: It was probably a Wolf coming to eat us because you talk too much.||B: I could see some high society ladies having their brunch over looking this canal.|
|A: And everyone would be copying me for it! It’ll be trendy!||A: I would never go camping in the woods for this very reason.||A: I could see them spending way too much on avocado toast here.|
Visual question answering (Antol et al., 2015) and visual dialogue (Das et al., 2017) are another set of tasks which employ vision and language. They require the machine to answer factual questions about the contents of the image, either in single turn or dialogue form. They do not attempt to model natural conversation, but rather assess whether the machine can perform basic perception over the image via a series of questions.
There are some works which directly address dialogue grounded with vision. The work of (Pasunuru and Bansal, 2018) assesses the ability to execute dialogue given video of computer soccer games. The work of (Huber et al., 2018) investigates the use of sentiment-based visual features and facial expressions for emotional image-based dialogue. Perhaps the most related work to ours is (Mostafazadeh et al., 2017). Their work considers (visual content, textual context, question, response) tuples, and builds a testing dataset based on 4k eventful images. In contrast, we provide training, validation and testing sets for our task over 202k images, and consider a general set of images and dialogues, not just events and questions plus responses. Importantly, we also consider the role of personality in grounded dialogue to create engaging conversations.
The Image-Chat dataset is a large collection of (image, personality trait for speaker A, personality trait for speaker B, dialogue between A & B) tuples that we collected using crowd-workers, made available in ParlAI (http://parl.ai). Each dialogue consists of consecutive turns by speaker A and B, typically three turns: (i) an utterance from A followed by (ii) an utterance from B and then (iii) an utterance from A. No particular constraints are placed on the kinds of utterance, only that we ask the speakers to use the provided personality trait, and to respond to the given image and dialogue history in an engaging way.
Our work builds upon the the work of Shuster et al. (2018), where the authors built a dataset for image captioning, not dialogue. Effectively, we use much of the same setup as in that work, but extend it to two-party conversations.
Shuster et al. (2018) considered 215 possible personality traits which were constructed by selecting a subset from a curated list of 638 traits111http://ideonomy.mit.edu/essays/traits.html. In our work, we use this same set, but apply it to both speakers A and B, who will be assigned different traits for each given conversation. Those two speakers (i.e. those two personalities) will thus interact with each other. The traits are categorized into three classes: positive (e.g., sweet, happy, eloquent, humble, perceptive, witty), neutral (e.g., old-fashioned, skeptical, solemn, questioning) and negative (e.g., anxious, childish, critical, fickle, frivolous). Examples of traits that we did not use are allocentric, insouciant, flexible, earthy and invisible, due to the difficulty of their interpretation with respect to conversation about an image. It was emphasized in the data collection instructions that the personality trait describes a trait of the speaker, not properties of the content of the image they are discussing.
The images used in our task are randomly selected from the YFFC100M Dataset222https://multimediacommons.wordpress.com/yfcc100m-core-dataset/; Thomee et al. (2016). We use the same image choices as in Shuster et al. (2018) to build our training, validation and test sets.
For each image, we pick at random two personality traits, one for each speaker (A and B) and collect the dialogue using crowdworkers. Rather than pairing two crowdworkers, we chose to collect the data sequentially: one crowdworker is given personality A and asked to speak on turn 1, followed by another crowdworker with personality B on turn 2, and a further crowdworker again with personality A on turn 3. This setup, which has also been employed elsewhere Budzianowski et al. (2018), has the advantage of simplicity. Systems for pairing crowdworkers are more difficult to setup as the workers both have to be online and ready to begin at the same time, and each crowdworker has to wait for the other to finish, making our setup much more efficient. Note that turn 3 has the same speaker as in turn 1 in the dialogue, but is likely authored by a different crowdworker: they are asked to continue the conversation as if they are the same speaker. We nevertheless found the resulting conversations to be natural. Some examples from the training set are given in Figure 1.
The overall dataset statistics are given in Table 1. This is a fairly large dialogue dataset compared to other existing publicly available datasets. For example, PersonaChat Zhang et al. (2018) (which is not grounded in images) consists of 162k utterances, while IGC Mostafazadeh et al. (2017) (grounded in images) consists of a 4k test set only, compared to over 400k utterances in Image-Chat split between train, valid and test.
|Number of Images||186,782||5,000||9,997|
|Number of Dialogues||186,782||5,000||9,997|
|Number of Utterances||355,862||15,000||29,991|
|Tokens per Utterance||12.3||12.4||12.4|
To build models, we make use of state-of-the-art existing methods for image captioning conditioned on personality Shuster et al. (2018) and extend those methods to the case of grounded dialogue. It was previously shown that state-of-the-art retrieval models outperform state-of-the-art generation models on such image captioning tasks Shuster et al. (2018) and on open-domain dialogue tasks Zhang et al. (2018), therefore we concentrate on developing retrieval models here.
The methods we try consist of several components. We use three sub-networks for the three modalities of input: (i) an image encoder, (ii) a dialogue history encoder; and (iii) a personality encoder. These are then fed into a combiner module for combining the three modalities. Finally, there is a response encoder for considering candidates responses and this is scored against the combined input representations. An overview of the system, called TransResNet, is shown in Figure 2.
4.1 Image Encoder
We build our models on top of pretrained image features, and compare the performance of two types of image encoders. The first is a residual network with 152 layers described in (He et al., 2015)
trained on Imagenet(Russakovsky et al., 2014)
to classify images among 1000 classes, which we refer to in the rest of the paper asResNet152 features. We used the implementation provided in the torchvision project (Marcel and Rodriguez, 2010). The second is a ResNeXt d (Xie et al., 2016) trained on 3.5 billion Instagram pictures following the procedure described by Mahajan et al. (2018), which we refer to in the rest of the paper as ResNeXt-IG-3.5B. The authors provided the weights of their trained model to us. The representation of an image
To condition on a given personality trait, we embed each trait to a 500-dimensional vector to obtain its representation.
The entire dialogue history is encoded into a fixed size vector using a Transformer architecture (Vaswani et al., 2017), followed by a linear layer. Transformers were already compared to simple models like bag-of-words encoders, and shown to outperform them on image captioning Shuster et al. (2018) and dialogue tasks Mazaré et al. (2018). We use a Transformer with 4 layers, 300 hidden units, and 6 attention heads.
We pretrain the entire encoder following the setup described in Mazaré et al. (2018): we train two encoders on a next-utterance retrieval task on a dataset of dialogues containing 1.7 billion pairs of utterances, where one encodes the context and another the candidates for the next utterance, their dot product indicates the degree of match, and they are trained with negative log-likelihood and -negative sampling. We then initialize our system using the weights of the candidate encoder only, and then train on our task.
Multimodal combiner module
We consider two possible combiner modules:
Multimodal sum combiner (MM-sum): Given an input image, personality trait and dialogue , together with a candidate response , the score of the final combination is computed as . This is one of the simplest forms of combination, but turns out to be a strong baseline.
Multimodal attention combiner (MM-att): A more sophisticated approach is to use an attention mechanism to choose which modalities are most relevant for this example by stacking transformers. We feed the three representation vectors , and
into a second transformer network (4 attention heads, 2 layers, 500 hidden units) which performs self-attention over the three inputs. The three modalities are thus reweighed by the corresponding attention weights to give the final representation vector.
We employ the same Transformer architecture as in the dialogue encoder for encoding candidate responses. We tried two variants: either sharing or not sharing the weights with the input dialogue encoder.
Training and Inference
Given a tuple , and a set of candidates , at inference time the predicted utterance is the candidate that maximizes the score
. At training time we pass a set of scores through a softmax and train to maximize the log-likelihood of the correct responses. We use mini-batches of 500 training examples; for each example, we use the gold responses of the other examples of the batch as negatives. Hyperparameters are chosen on the validation set.
|Combiner||Text Encoders||Image Encoder||Turn 1||Turn 2||Turn 3||All|
|Model||Turn 1||Turn 2||Turn 3||All|
|Dialogue History Only||1.0||33.7||32.3||22.3|
|Personality + Dialogue (no image)||17.9||45.4||43.1||35.4|
|Image + Dialogue (no personality)||37.6||39.4||32.6||36.5|
|Image+ Personality (no dialogue)||54.0||41.1||35.2||43.4|
|Personality + Dialogue + Image (full model)||53.2||51.4||44.3||49.6|
We test our architectures on the Image-Chat dataset using automatic metrics and human evaluations. We additionally compute automatic metrics for variants of our architectures in which we remove certain modalities to determine the importance of each of the model’s inputs.
5.1 Automatic Evaluation on Image-Chat
We compare various configurations of our full TransResNet model, and compute recall at 1 and 5 (R@1/100 and R@5/100) retrieval metrics, where for each sample there are 100 candidates to rank: 99 random candidates chosen from the test set, and the true label. We additionally show the results for a simple information retrieval baseline, in which the candidates are ranked according to their weighted word overlap to the input message.
The results are shown in Table 2. We report the average metrics for the total task, as well as the breakdown of the performance on each turn of dialogue (turns 1, 2 and 3). The average metrics indicate that using the ResNeXt-IG-3.5B image encoder features improve performance significantly across the whole task, as we obtain 50.3% R@1 for our best ResNeXt-IG-3.5B model and only 40.6% for our best ResNet152 model. When broken down by turn, it appears that the ResNeXt-IG-3.5B features are particularly important in the the first round of dialogue, in which only the image and personality are considered, as the difference between our best models increases from 9.7% in the full task to 19.5% in the first turn. Our baseline multimodal sum combiner (MM-Sum) outperforms the more sophisticated self-attention (MM-Att) combiner, with the latter scoring 49.3% on the full task. Separate candidate and dialogue history text encoders also works better than sharing weights.
In general, for our models performance diminishes as the dialogue turn increases; our best model’s performance over each turn decreases from 54.0% in the first round, to 51.9% in the second round, and finally 44.8% in the third round. We believe that this reflects the nature and difficulty of the corresponding tasks at each round of dialogue. In the first turn, the model only needs to condition on the image and personality, and the utterance is specific to the image; however, as the conversation evolves, it becomes harder to select the correct response because there are more possible appropriate responses potentially less directly related to the images itself, while the R@1 metric reflects the model’s ability to select the exact one response used in the actual conversation.
We additionally compare variants of our TransResNet (MM-Sum) model where we remove modalities (image, personality, and dialogue history) and report R@1/100 for each dialogue turn independently, and the average over all turns.
The results are shown in 3. We note a number of interesting trends.
Turn 1: For the first round of dialogue, in which the model produces an utterance given the image and personality only, as there is no dialogue history in this case. The model obtains 37.4% R@1 using only the image, and 18.3% using only the personality, indicating that the image is more important in isolation than the personality. However, using both together helps.
Turn 2: In the second round of dialogue, in which the model produces a response to a first utterance, the model performs similarly when using only the image or only the dialogue history (28.1% and 33.7% respectively), while performing poorly with just the personality (15.3%). Any combination of two modalities improves the results similarly, with the personality + dialogue combination performing slightly higher than the other two. Using all modalities works best.
Turn 3: By the third turn of dialogue, the conversation history proves to be the most important in isolation than the other two modalities in isolation, with the model obtaining 32.3% using only the dialogue history and 20.7% and 17.0% using just the image or personality, respectively. Conditioning on the personality+dialogue is the most effective of any combination of two modalities, yielding 43.1% compared to 35.2% and 32.6% for image+dialogue and image+personality, respectively. Again, using all modalities still gives a boost, obtaining 44.3%.
5.2 Human Evaluations on Image-Chat
We also test our best model in human evaluation studies in terms of the engagingess of our model as measured by humans.
In the same fashion as Shuster et al. (2018), we use a set of 500 images from YFCC-100M that are not present in Image-Chat to build a set of three-round dialogues pairing humans with our best model in conversation. We then conduct evaluations at each round of dialogue for each example in the evaluation set; we have a separate set of human annotators look at various static conversation turns, and ask them to compare two possible utterances for the next turn of conversation, given the image, dialogue history and relevant personality. We ask the annotators in a blind test to choose the ”more engaging” of the two possible utterances: one from a human, and the other from our model. The personality on which the utterance is conditioned is shown to the annotator (fixed to be the same for both candidate responses).
Human annotation vs. TransResNet model
We compare human-authored utterances to those produced by our model. The human conversations are collected in the same fashion as in Image-Chat. Each model output is conditioned on the human-authored dialogue history, where the candidates for retrieval are utterances from the Image-Chat training set. The model is given a separate set of candidates corresponding to the round of dialogue - e.g. when producing a response to turn 1, the model’s choice is limited to only responses to turn 1 responses in the train set. Both human and model outputs are conditioned on the same personality.
The results are shown in Figure 3. Our best TransResNet MM-Sum model from automatic evaluations performs quite strongly on all three turns of the dialogue task when compared to human-authored utterances. In turn 1, the model has a win rate of 49.4% (difference not significant, ), in line with the captioning results from Shuster et al. (2018). When continuing a conversation, the model has a win rate of 45.6% in the second round of dialogue, and 48.2% win rate in the third round of dialogue (both wins for humans being significant). Example predictions of our model for turns 1, 2 and 3 can be seen in Figures 4, 5 and 6.
This paper presents an approach for improving the way machines can generate conversations that humans might find engaging. Focusing on the (generic) case of chit-chatting about a given image, our work shows that an end-to-end trained model can generate grounded dialogues that humans prefer over dialogues with other fellow humans almost half of the time (47.7%). This result is made possible by the creation of a new dataset Image-Chat and a generalization of the model introduced in Shuster et al. (2018) to the case of dialogue. The dataset will be made publicly available in ParlAI (http://parl.ai) and we look forward to further improvements being made by the community.
Our work shows that we are close to having models that humans can relate to in chit-chat conversations, which could set new ground for social dialogue agents. The next challenge will be to combine this engagingness with other aspects required of such agents, such as domain expertise, or task-proficiency.
We thank Laurens van der Maaten, Arthur Szlam, Y-Lan Boureau, Pierre-Emmanuel Mazaré, Martin Raison, Alex Lebrun, Emily Dinan, Alexander Miller and Jack Urbanek for advice and discussions.
- Anderson et al. (2018) Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and vqa. CVPR.
Antol et al. (2015)
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra,
C Lawrence Zitnick, and Devi Parikh. 2015.
Vqa: Visual question answering.
Proceedings of the IEEE international conference on computer vision, pages 2425–2433.
- Bordes et al. (2017) Antoine Bordes, Y-Lan Boureau, and Jason Weston. 2017. Learning end-to-end goal-oriented dialog. In Proceedings of the International Conference on Learning Representations (ICLR).
- Budzianowski et al. (2018) Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gašić. 2018. Multiwoz-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. arXiv preprint arXiv:1810.00278.
- Chandrasekaran et al. (2017) Arjun Chandrasekaran, Devi Parikh, and Mohit Bansal. 2017. Punny captions: Witty wordplay in image descriptions. arXiv preprint arXiv:1704.08224.
- Chen et al. (2015) Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. 2015. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
Das et al. (2017)
Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav,
José MF Moura, Devi Parikh, and Dhruv Batra. 2017.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 2.
- Faghri et al. (2017) Fartash Faghri, David J. Fleet, Ryan Kiros, and Sanja Fidler. 2017. VSE++: improved visual-semantic embeddings. CoRR, abs/1707.05612.
- Gan et al. (2017) Chuang Gan, Zhe Gan, Xiaodong He, Jianfeng Gao, and Li Deng. 2017. Stylenet: Generating attractive visual captions with styles. In Proc IEEE Conf on Computer Vision and Pattern Recognition, pages 3137–3146.
- Gu et al. (2017) Jiuxiang Gu, Jianfei Cai, Shafiq R. Joty, Li Niu, and Gang Wang. 2017. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. CoRR, abs/1711.06420.
- He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. CoRR, abs/1512.03385.
- Henderson et al. (2014) Matthew Henderson, Blaise Thomson, and Jason D Williams. 2014. The second dialog state tracking challenge. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), pages 263–272.
- Hu et al. (2014) Yuheng Hu, Lydia Manikonda, and Subbarao Kambhampati. 2014. What we instagram: A first analysis of instagram photo content and user types. In Eighth International AAAI Conference on Weblogs and Social Media.
- Huber et al. (2018) Bernd Huber, Daniel McDuff, Chris Brockett, Michel Galley, and Bill Dolan. 2018. Emotional dialogue generation using image-grounded language models. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, page 277. ACM.
- Mahajan et al. (2018) Dhruv Mahajan, Ross B. Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. 2018. Exploring the limits of weakly supervised pretraining. CoRR, abs/1805.00932.
- Marcel and Rodriguez (2010) Sébastien Marcel and Yann Rodriguez. 2010. Torchvision the machine-vision package of torch. In Proceedings of the 18th ACM International Conference on Multimedia, MM ’10, pages 1485–1488, New York, NY, USA. ACM.
- Mathews et al. (2018) Alexander Mathews, Lexing Xie, and Xuming He. 2018. Semstyle: Learning to generate stylised image captions using unaligned text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8591–8600.
- Mathews et al. (2016) Alexander Patrick Mathews, Lexing Xie, and Xuming He. 2016. Senticap: Generating image descriptions with sentiments. In AAAI, pages 3574–3580.
- Mazaré et al. (2018) P.-E. Mazaré, S. Humeau, M. Raison, and A. Bordes. 2018. Training Millions of Personalized Dialogue Agents. ArXiv e-prints.
- Mostafazadeh et al. (2017) Nasrin Mostafazadeh, Chris Brockett, Bill Dolan, Michel Galley, Jianfeng Gao, Georgios P Spithourakis, and Lucy Vanderwende. 2017. Image-grounded conversations: Multimodal context for natural question and response generation. arXiv preprint arXiv:1701.08251.
- Nam et al. (2016) Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. 2016. Dual attention networks for multimodal reasoning and matching. CoRR, abs/1611.00471.
- Pasunuru and Bansal (2018) Ramakanth Pasunuru and Mohit Bansal. 2018. Game-based video-context dialogue. arXiv preprint arXiv:1809.04560.
- Russakovsky et al. (2014) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. 2014. Imagenet large scale visual recognition challenge. CoRR, abs/1409.0575.
- Shuster et al. (2018) Kurt Shuster, Samuel Humeau, Hexiang Hu, Antoine Bordes, and Jason Weston. 2018. Engaging image captioning via personality. arXiv preprint arXiv:1810.10665.
- Sordoni et al. (2015) Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015. A neural network approach to context-sensitive generation of conversational responses. arXiv preprint arXiv:1506.06714.
- Thomee et al. (2016) Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. Yfcc100m: The new data in multimedia research. Commun. ACM, 59(2):64–73.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008.
- Vinyals and Le (2015) Oriol Vinyals and Quoc Le. 2015. A neural conversational model. In
- Vinyals et al. (2015) Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164.
- Xie et al. (2016) Saining Xie, Ross B. Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2016. Aggregated residual transformations for deep neural networks. CoRR, abs/1611.05431.
- Xu et al. (2015) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057.
- Young et al. (2014) Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78.
- Zhang et al. (2018) Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 2204–2213, Melbourne, Australia. Association for Computational Linguistics.