With the development of instant messaging technology in the recent decades, the intermediary of online conversation has also changed from pure text to a variety of visual modalities (e.g. image, gif animation, short video). Similar to communicating by the messenger tools (e.g. Facebook, WhatsApp, WeChat) in reality, a good intelligent conversational agent should not only be able to converse freely with plain text, but also have the ability to perceive and share the real physical world, which is also an important pursuit of artificial intelligence (AI). Although recently some large-scale pre-trained text-only dialogue generation models, such as DialoGPTZhang et al. (2020), Blender Roller et al. (2021), Meena Adiwardana et al. (2020), have shown excellent performance, they still cannot rely exclusively on plain text to completely simulate the rich experience of visual perception.
Recently, various vision-language tasks have been introduced and attracted widespread attention, such as visual question answering Ren et al. (2015); Gao et al. (2015); Lu et al. (2016); Anderson et al. (2018); Li et al. (2019a); Huang et al. (2020)2015); Anderson et al. (2016); Ghanimifard and Dobnik (2019); Cornia et al. (2020), image-grounded dialogue Das et al. (2017); Yang et al. (2021); Agarwal et al. (2020); Qi et al. (2020); Chen et al. (2021); Liang et al. (2021), photo sharing Zang et al. (2021). Specifically, photo sharing, which aims to selecting and sharing the image based on the textual context, is a challenging task that requires models to understand the background story which complemented by human imaginations, rather than to locate related visual objects or explicitly mention main visible content in the image as the previous works do. Zang et al. (2021)
propose a retrieval-based method to resolve above challenge. However, the performance of retrieval-based method is limited in specific domains by the size of the pre-constructed conversational history repository, especially for long tail contexts those are not covered in the history, where the set of image responses of a retrieval system is also fixed. On the other hand, to make an image reply appropriately for the context, a better way is to generate a new one accordingly. With the prosperity of pre-trained networks powered by deep learning, image generation models are developing good flexibility and high-resolution qualityRamesh et al. (2021); Esser et al. (2021); Ding et al. (2021).
In this paper, we formulate a new problem: Multimodal Dialogue Response Generation (MDRG). As the conversations shown in Figure 1, given the dialogue context, the model should decide to generate an informative text or a high-resolution image as response. We argue that there still some hindrances to go for application in real scenarios, since (1) the sophisticated neural end-to-end architecture will just overfit to very few well-annotated training data (e.g., a few existing 8k multimodal dialogues). An evidence is that when discussing the contents out of the domain of the training data, its performance drops dramatically, as will be seen in our experiments; and (2) as human effort is expensive, it is difficult to collect enough training data for a new domain. Based on the above facts, we take a step further to extend the assumption of MDRG to a low-resource setting where only a few multimodal dialogues containing both texts and images are available.
To tackle the above challenges, we are aiming to have the first attempt to incorporate text-to-image generation into text-only open domain dialogue generation. The key idea is to make parameters that rely on multimodal dialogues small and independent by disentangling textual dialogue response generation and text-to-image generation, and thus we can learn the major part of the generation model from text-only dialogues and <text description, image> pairs that are much easier to be obtained. Specifically, we present Divter, a novel conversational agent powered by large-scale visual world experiences. As shown in Figure 2, Divter is made up of two Transformer-based Vaswani et al. (2017a) components: a textual dialogue response generator, and a text-to-image generator. The textual dialogue response generator takes the dialogue context as input, then determines the modality of following response, and generates a textual response or a textual description of image. The text-to-image generator takes above textual image description as condition, then generates a realistic and consistent high resolution image as final response. Both components are independent with the opposite knowledge, and thus can be pre-trained using text-only dialogues and (text description, image) pairs respectively. The end-to-end Divter depends on the multimodal dialogues constructed as (text dialogue context, text response / (text description, image)
) tuple, but the joint learning and estimation of the two components just require a few training examples depending on specific domains or tasks. By fine-tuning the pre-trained parameters, we can adapt the model to a new domain with only little cost.
Contributions of this work are three-fold:
To the best of our knowledge, it is the first work on the multimodal dialogue response generation. We explore the task under a low-resource setting where only a few multimodal dialogues are assumed available.
We present Divter, a novel conversational agent consisting of two flexible components, which can effectively understand dialogue context and accordingly generate informative text and high-resolution image responses.
Extensive experiments on PhotoChat Corpus Zang et al. (2021) indicate the effectiveness of Divter.
2 Related Work
2.1 Textual Dialogue Response Generation
End-to-end response generation for textual open domain dialogues is inspired by the successful application of neural sequence-to-sequence models on machine translation Sutskever et al. (2014). On top of the basic architecture Shang et al. (2015); Vinyals and Le (2015), the vanilla encoder-decoder method is widely extended to address the key challenges in open-domain dialogue systems, including improving diversity of responses Li et al. (2016a); Zhao et al. (2017); Tao et al. (2018), modeling conversation contexts Serban et al. (2016); Xing et al. (2017); Zhang et al. (2019), controling attributes of responses See et al. (2019); Zhou et al. (2018), biasing responses to some specific personas Li et al. (2016b); Zhang et al. (2018), incorporating extra knowledge into generation Dinan et al. (2019); Ghazvininejad et al. (2018); Kim et al. (2020), and building general pre-trained agent Adiwardana et al. (2020); Zhang et al. (2020); Roller et al. (2021). Different from the previous works on open domain dialogue response generation that converse freely with plain text, our work lies in the research of multimodal response generation where an image response or a text response is generated with a given dialogue context.
2.2 Text-to-Image Generation
In the research of text-to-image generation, various works have been extensively studied. Mansimov et al. (2016) shown the Draw generative model Gregor et al. (2015) could generate images from natural language descriptions. Reed et al. (2016)
proposed a generative adversarial network to improve the image fidelity. Then some improvement methods continue to optimizing the generation architecture, such as stacked generatorsZhang et al. (2017), attentional network Xu et al. (2018), and extra knowledges Li et al. (2019b). Nguyen et al. (2017) we provided a unified probabilistic interpretation of related activation maximization methods to produce high quality images at higher resolutions. Separately, Cho et al. (2020) use uniform masking with a large range of masking ratios and align the right pre-training datasets to the right objectives. More recently, Ramesh et al. (2021)
propose a transformer-based method that autoregressively models the text and image tokens as a single stream of data. Their model was pre-trained on a large-scale dataset of 250 million text-image pairs and has shown competitive performance with previous domain-specific models when evaluated in a zero-shot fashion. For this multimodal response generation scenario, we use the text description of image to bridge above textual dialogue generation and text-to-image generation models, where the text description is output of the former and input of the latter in a low-resource setting.
In the following sections, we will elaborate our approach to learning a multimodal response generation model with multimodal dialogues , text-only dialogues , and (text description, image) pairs .
3 Problem Formalization
Suppose that we have dataset , where , is the context of the dialogue with the -th utterance, and is the response regarding to . Specifically, has two types: a text , or a pair where is an image with its text description . In addition to , we further assume that there are and with a (dialogue context, response) pair, and a (text description, image) pair, where and . and .
The goal is to learn a generation model ( denotes the parameters of the model) with . Thus, given a new dialogue context , one can generate a text response or an image response following .
Our idea is inspired by the observation on the nature of real-world open domain multimodal dialogues: (1) despite the fact that a multimodal dialogue contains an image somewhere, utterances in the dialogue are not always related to ; (2) the main content and semantic in an image often can be described with a synonymous text. Therefore, given a dialogue context, we postulate that formation of a response could be decomposed into four uncorrelated actions: (i) determining whether to () generate a text response or () generate an image response ; (ii) if , generating a text response; if , (iii) generating a text description; (iv) generating synonymous image response with the text description as a condition. All the actions can be independently learned, which becomes the key to aiding the small with the large and .
Figure 2 illustrates the architecture of our model. The model is made up of two components: a textual dialogue response generator and a text-to-image generator . In the rest of this section, we will elaborate these two modules in detail.
4.1 Textual Dialogue Response Generator
The textual dialogue response generator (for actions (i)(iii)) is a sequence-to-sequence model based on the Transformer architecture Vaswani et al. (2017b), it consists of a 24-layers Transformer with a hidden size of 1024 and 16 heads. Specifically, given a dialogue context as source, if the target is a text response with the -th word in the sequence, the generation loss is defined by
and if the target is a text image description with a tag means the next text sequence is a description, the generation loss is defined by
Inference Given a new dialogue context , when a generated description occurs, it will be fed into the following text-to-image generator, then reconstructed to an image as the last response.
4.2 Text-to-Image Generator
The text-to-image generator (for action (iv)) consists of an image representation module and a text-to-image transformer module .
4.2.1 Image Representation Modeling
Inspired by the success of DALLE Esser et al. (2020) and VQGAN Ramesh et al. (2021), to utilize the highly expressive transformer architecture for image generation, we need to express the constituents of an image in the form of a sequence.
Inheriting from Ramesh et al. (2021), we use a learned discrete codebook to represent the image, such that any image with the height and the width, it can be represented by a spatial collection of codebook entries , where is the dimensionality of codes. By this means, the equivalent representation of image is a sequence of indices which specify the respective entries in the learned, discrete codebook . Then, we take a convolutional model consists of an encoder and a decoder together to learn to represent images with codes from . The function of is similar to the word embedding representation Mikolov et al. (2013); Pennington et al. (2014); Peters et al. (2018)
in natural language processing (NLP). This would help the model to align the cross-modal representations between text and visual regions.
Specifically, given an image , we use the encoder to produce its spatial representation , then under the action of element-wise quantization of , each spatial code will be mapped to its closest codebook entry :
Then, the reconstruction can be approximated by
We modeling as an image representation module , and the learning details of could be found in Ramesh et al. (2021), they also adapt sliding window approach to generate high-resolution (e.g., 256 × 256 pixels) images.
4.2.2 Text-to-Image Transformer
The text-to-image transformer is also a sequence-to-sequence model based on the Transformer architecture, it consists of a 24-layers Transformer with a hidden size of 1024 and 16 attention heads. Specifically, given an image and its text description . With and available, we can represent in terms of the codebook-indices of its embeddings, the quantized embedding of is given by Eq. (5) and is equivalent to a sequence of indices from the codebook
Thus we can obtain the by mapping the indices to their corresponding codebook entries. Then we concatenate and to a single stream
and train an autoregressive transformer to model the joint distribution over the text and image tokens, the generation loss is defined by
Inference Given a new text description , we leverage the text-to-image generator to reconstruct its synonymous image . More precisely, the text-to-image transformer first takes as input and generates its image representation , then the function maps to the closest codebook entries by the codebook , then uses the image decoder to reconstruct to an image .
4.3 Learning Details
Let us denote as the parameters of textual dialogue response generator , image representation module and text-to-image transformer module . In the pre-training stage, we use textual dialogues to estimate
, use the ImageNetDeng et al. (2009) to estimate , use (text description, image) pairs to estimate . Then in the fine-tuning stage for MDRG scenario, we fix the , then jointly fine-tune the and with , thus the final objective of our multimodal response generation model is to minimize the integrated loss
where is a hyper parameter.
Remarks. In this work, we mainly focus on integrating text and image responses generation, but our proposed approach actually provides a recipe for a general solution to low-resource multimodal dialogue response generation in which the target modality (if not text) could be gifs, videos, or speech sounds, etc. To do that, one only needs to modify the text-to-image generator to make it compatible with the specific modality type, then pre-train a new text-to-<target modality> generator.
To evaluate the performance of Divter, we conduct comprehensive experiments on the PhotoChat dataset released by Zang et al. (2021), which is a multi-turn multimodal conversations consists of 10917 unique images and 12286 dialogues, each of which is paired with an user image that is shared during the conversation, and each image is paired with its text description. The dialogues has four categories: people, food, animals, and daily products. The dataset has been split into 10286 train, 1000 dev, and 1000 test sets.
5.2 Evaluation Metrics
We conduct evaluation with both automatic metrics and human judgements. For automatic evaluation, we focus on four aspects: (1) Image Response Intent Prediction, the goal of this task is to predict whether a image should be produced in the next turn for given context; (2) Text Description Generation; (3) Image Generation Quality ; (4) Text Response Generation. For (1), we follow Zang et al. (2021), which formulates the problem as a binary classification task, and use F1 as metric; for (2) and (4), we use PPL, BLEU (Papineni et al., 2002), Rouge (Lin, 2004) and F1; for (3) we follow Ramesh et al. (2021) and use Frechet Inception Distance (FID) and Inception Score (IS).
|Models||Intent||Text Description Generation||Image Generation||Text Response Generation|
|Divter (w/o pre-train)||47.3||122.56||1.99||0.23||2.60||29.78||15.5 0.5||153.62||4.82||0.53||3.83|
|Divter (w/o pre-train)||55.9||5.23||15.01||11.20||15.63||262.09||4.9 0.7||63.76||6.28||1.51||5.40|
|Divter (w/o , pre-train)||47.1||128.87||1.75||0.21||2.38||254.31||5.2 0.6||163.85||4.53||0.48||3.55|
|Divter (w/o joint learning)||55.6||5.20||15.00||11.36||15.73||29.04||15.4 0.6||59.21||6.47||1.58||5.63|
. Numbers in bold mean that the improvement to the best baseline is statistically significant (t-test with-value 0.01).
|Divter (w/o pre-train)||0.94||1.56||0.61||0.35||0.66|
For human evaluation, we randomly sample 200 dialogue contexts and the corresponding generated responses (100 text responses and 100 image responses respectively) from PhotoChat for Divter and its variants without any pre-training. Three human annotators are asked to score the response quality on a scale of 0, 1, 2 from four aspects: (1) Context Coherence: Whether the text response is coherent with the context and guides the following utterances; (2) Fluency: Whether the text response is natural and fluent; (3) Image Quality: The quality (including definition and integrity) of the image response; (4) Background Consistency of Image: For each dialogue, We select the top-8 generated/retrieved images group and ask the annotators to decide whether the group is consistent with the dialogue background, a qualitative assessment is also shown in Table 5. we report the average scores over three annotators, and the higher score means the better. The agreement among the annotators is measured via Fleiss’ Kappa Fleiss (1971).
5.3 Implementation Details
For the textual dialogue response generator , we use DialoGPT Zhang et al. (2020) as pre-trained model initialization, which has been trained on 147M conversation-like exchanges extracted from Reddit comment chains over a period spanning from 2005 through 2017. In the fine-tuning stage, we concatenate the context turns with token [SEP] as a single sequence, we adopt Adam optimizer as an initial learning rate of 1e-5, and the batch size is 256, the training of PhotoChat is conducted on 16 Nvidia Tesla V100 32G GPU cards. We use beam search(size=5) to decode the text sequence.
, which has been trained on 10M ImageNet images, the model enables image generation beyond a resolution of 256256 pixels.
For the text-to-image transformer , we randomly select 5M (categorical text description, image) pairs from ImageNet, and 10M (text description, image) pairs from YFCC100M Thomee et al. (2016) as training data, we set the text descripition length as 32 and the codebook entries length as 256 (1616) and concatenate them with [SEP] as a sequence with length of 289. We then pre-train the model for 3.5 million steps with a batch size of 256 accumulated on 16 Nvidia Tesla V100 32G GPU cards. In the fine-tuning stage, we train the PhotoChat for 50000 steps. In the inference stage, we use CLIP Radford et al. (2021) to rerank the generated 256 samples, CLIP assigns a score based on how well the image matches the description, we select the best one as response.
In the joint learning scenario, we first train the for 48000 steps, then jointly train and for 2000 steps. The in Eq. 11 is 0.2. We discard the sentences with the prefix of “The photo has your *” in descriptions. Early stopping on validation is adopted as a regularization strategy. All the hyper parameters are determined by grid search.
The following two pre-trained models are selected as baselines to measure the “Image Response Intent Prediction” task in Section 5.2:
BERT-base BERT Devlin et al. (2019) is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.
T5-3B T5 Raffel et al. (2020) is a pretrained model which has achieved remarkable performance in many NLP tasks.
We adopt above models as baselines and use the results reported by Zang et al. (2021).
5.5 Evaluation Results
As shown in Table 1, our Divter not only achieves comparable performance with the state-of-the-art retrieval-based image response intent prediction model, but also achieves remarkable performance in all the generation metrics. This indicates that Divter can accurately judge the timing of generating image response with the given dialogue context, and produce text responses which are coherent to the context, and generate high-quality image responses. The significant performance gap between Divter and those models without pre-training indicates the superiority of our proposed learning strategy. Table 2 reports the results of human evaluation, our Divter also significantly outperforms the baselines on most of the aspects.
5.6 Ablation Study
We conduct extensive ablation experiments over different variants to better understand their relative importance to the MDRG task. As shown in Table 1, all the variants lead to worse performance in most of the metrics. For more intuitive comparison, the qualitative assessment results are also shown in Table 3. In particular, both quantitative and qualitative results on the ablation study validate that: (1) pre-training is crucial to low-resource multimodal dialogue response generation, since removing any component from pre-training causes performance drop when training data is small; (2) in terms of impact to performance of image generation, >
, in terms of impact to performance of text generation,> ; (3) The joint learning also has contribution to Divter, indicating that leveraging the integrated learning of textual context and visual image with the description benefits more contrasting for any single one of them.
5.7 Case Study
To further investigate the quality of multimodal responses generated by Divter, we show two examples on the PhotoChat test data in Table 4. The given context of the first one is talking about “ice cream”, and the second one is talking about “honey bee”. As we can see, Divter can not only generate a realistic high-resolution image which is coherent to the background, but also generate the informative text responses grounded on the image. Separately, The high-quality of the generated images are comparable to those real-world ground truths, which demonstrates the practicability of Divter.
Benefits over retrieval-based methods To further investigate and compare the generalization capability between Divter and the retrieval-based method, we also sample top-8 generated images from Divter and equivalent retrieved images from SCAN model. As shown in Table 5, on the one hand, the diversity and richness of the generated images are desirable, on the other hand, those retrieved results often suffer from wrong consistency with dialogue background. For example in the second case, the dialogue is talking about “coffee”, but the retrieved images contain some uncorrelated objects like “milk”, “cake”, “dog’ and “snack”. And in the third example, all the retrieval results are mistaken since there is little “curtain” in training and retrieval space. This demonstrates the fact that the performance of retrieval-based method is extremely limited in specific domain by the size of the pre-constructed conversational history repository, especially in the low-resource scenario. Furthermore, our proposed generation based method shows better generalization capability to tackle the low-resource challenge.
In this paper, we explore multimodal dialogue response generation under a low-resource setting. To overcome the challenge from the new task and insufficient training data, we propose Divter, a neural conversational agent which incorporates text-to-image generation into text-only dialogue response generation, in which most parameters do not rely on the training data any more and can be estimated from large scale textual open domain dialogues and (text description, image) pairs. Extensive experiments demonstrate Divter achieves state-of-the-art results in automatic and human evaluation. In the future, we will explore more efficient methods to inject more modalities into response generation.
- Adiwardana et al. (2020) Daniel Adiwardana, Minh-Thang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, and Quoc V. Le. 2020. Towards a human-like open-domain chatbot.
- Agarwal et al. (2020) Shubham Agarwal, Trung Bui, Joon-Young Lee, Ioannis Konstas, and Verena Rieser. 2020. History for visual dialog: Do we really need it? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8182–8197, Online. Association for Computational Linguistics.
- Anderson et al. (2016) Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In ECCV.
- Anderson et al. (2018) Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.
- Chen et al. (2021) Feilong Chen, Fandong Meng, Xiuyi Chen, Peng Li, and Jie Zhou. 2021. Multimodal incremental transformer with visual grounding for visual dialogue generation. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 436–446, Online. Association for Computational Linguistics.
- Cho et al. (2020) Jaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, and Aniruddha Kembhavi. 2020. X-lxmert: Paint, caption and answer questions with multi-modal transformers. In EMNLP.
- Cornia et al. (2020) Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2020. Meshed-Memory Transformer for Image Captioning. In
- Das et al. (2017) Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh, and Dhruv Batra. 2017. Visual dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, page 326–335.
- Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Dinan et al. (2019) Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. 2019. Wizard of wikipedia: Knowledge-powered conversational agents. In International Conference on Learning Representations.
- Ding et al. (2021) Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, and Jie Tang. 2021. Cogview: Mastering text-to-image generation via transformers.
- Esser et al. (2021) Patrick Esser, Robin Rombach, and Bjorn Ommer. 2021. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12873–12883.
- Esser et al. (2020) Patrick Esser, Robin Rombach, and Björn Ommer. 2020. Taming transformers for high-resolution image synthesis.
- Fleiss (1971) Joseph L. Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin, 76:378–382.
- Gao et al. (2015) Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, and Wei Xu. 2015. Are you talking to a machine? dataset and methods for multilingual image question answering. In NIPS, page 2296–2304.
- Ghanimifard and Dobnik (2019) Mehdi Ghanimifard and Simon Dobnik. 2019. What goes into a word: generating image descriptions with top-down spatial knowledge. In Proceedings of the 12th International Conference on Natural Language Generation, pages 540–551, Tokyo, Japan. Association for Computational Linguistics.
- Ghazvininejad et al. (2018) Marjan Ghazvininejad, Chris Brockett, Ming-Wei Chang, Bill Dolan, Jianfeng Gao, Wen-tau Yih, and Michel Galley. 2018. A knowledge-grounded neural conversation model. Proceedings of the AAAI Conference on Artificial Intelligence, 32.
Gregor et al. (2015)
Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Rezende, and Daan Wierstra.
recurrent neural network for image generation.
Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1462–1471, Lille, France. PMLR.
- Huang et al. (2020) Qingbao Huang, Jielong Wei, Yi Cai, Changmeng Zheng, Junying Chen, Ho-fung Leung, and Qing Li. 2020. Aligned dual channel graph convolutional network for visual question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7166–7176, Online. Association for Computational Linguistics.
- Kim et al. (2020) Byeongchang Kim, Jaewoo Ahn, and Gunhee Kim. 2020. Sequential latent knowledge selection for knowledge-grounded dialogue. In International Conference on Learning Representations.
- Lee et al. (2018) Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. arXiv preprint arXiv:1803.08024.
- Li et al. (2016a) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016a. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 110–119, San Diego, California. Association for Computational Linguistics.
- Li et al. (2016b) Jiwei Li, Michel Galley, Chris Brockett, Georgios Spithourakis, Jianfeng Gao, and Bill Dolan. 2016b. A persona-based neural conversation model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 994–1003, Berlin, Germany. Association for Computational Linguistics.
- Li et al. (2019a) Linjie Li, Zhe Gan, Yu Cheng, and Jingjing Liu. 2019a. Relation-aware graph attention network for visual question answering. ICCV.
- Li et al. (2019b) Wenbo Li, Pengchuan Zhang, Lei Zhang, Qiuyuan Huang, Xiaodong He, Siwei Lyu, and Jianfeng Gao. 2019b. Object-driven text-to-image synthesis via adversarial training.
- Liang et al. (2021) Zujie Liang, Huang Hu, Can Xu, Chongyang Tao, Xiubo Geng, Yining Chen, Fan Liang, and Daxin Jiang. 2021. Maria: A visual experience powered conversational agent. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5596–5611, Online. Association for Computational Linguistics.
- Lin (2004) Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- Lu et al. (2016) Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In NIPS, page 289–297.
- Mansimov et al. (2016) Elman Mansimov, Emilio Parisotto, Jimmy Ba, and Ruslan Salakhutdinov. 2016. Generating images from captions with attention. In ICLR.
- Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space.
- Nguyen et al. (2017) Anh Nguyen, Jeff Clune, Yoshua Bengio, Alexey Dosovitskiy, and Jason Yosinski. 2017. Plug & play generative networks: Conditional iterative generation of images in latent space. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.
- Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proc. of NAACL.
- Qi et al. (2020) Jiaxin Qi, Yulei Niu, Jianqiang Huang, and Hanwang Zhang. 2020. Two causal principles for improving visual dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR.
- Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
- Ramesh et al. (2021) Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8821–8831. PMLR.
- Reed et al. (2016) Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016. Generative adversarial text to image synthesis. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1060–1069, New York, New York, USA. PMLR.
- Ren et al. (2015) Mengye Ren, Ryan Kiros, and Richard Zemel. 2015. Exploring models and data for image question answering. In NIPS, page 2953–2961.
- Roller et al. (2021) Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Eric Michael Smith, Y-Lan Boureau, and Jason Weston. 2021. Recipes for building an open-domain chatbot. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 300–325, Online. Association for Computational Linguistics.
- See et al. (2019) Abigail See, Stephen Roller, Douwe Kiela, and Jason Weston. 2019. What makes a good conversation? how controllable attributes affect human judgments.
- Serban et al. (2016) Iulian Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. 2016. Building end-to-end dialogue systems using generative hierarchical neural network models. Proceedings of the AAAI Conference on Artificial Intelligence, 30.
- Shang et al. (2015) Lifeng Shang, Zhengdong Lu, and Hang Li. 2015. Neural responding machine for short-text conversation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1577–1586, Beijing, China. Association for Computational Linguistics.
Sutskever et al. (2014)
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
Sequence to sequence learning with neural networks.In NIPS, page 3104–3112.
- Tao et al. (2018) Chongyang Tao, Shen Gao, Mingyue Shang, Wei Wu, Dongyan Zhao, and Rui Yan. 2018. Get the point of my utterance! learning towards effective responses with multi-head attention mechanism. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pages 4418–4424. International Joint Conferences on Artificial Intelligence Organization.
- Thomee et al. (2016) Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. Yfcc100m. Communications of the ACM, 59(2):64–73.
- Vaswani et al. (2017a) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017a. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Vaswani et al. (2017b) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017b. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
- Vinyals and Le (2015) Oriol Vinyals and Quoc Le. 2015. A neural conversational model.
- Xing et al. (2017) Chen Xing, Wei Wu, Yu Wu, Ming Zhou, Yalou Huang, and Wei-Ying Ma. 2017. Hierarchical recurrent attention network for response generation.
- Xu et al. (2015) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 2048–2057, Lille, France. PMLR.
- Xu et al. (2018) Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. 2018. Attngan: Fine-grained text to image generation with attentional generative adversarial networks.
- Yang et al. (2021) Ze Yang, Wei Wu, Huang Hu, Can Xu, Wei Wang, and Zhoujun Li. 2021. Open domain dialogue generation with latent images. Proceedings of the AAAI Conference on Artificial Intelligence, 35(16):14239–14247.
- Zang et al. (2021) Xiaoxue Zang, Lijuan Liu, Maria Wang, Yang Song, Hao Zhang, and Jindong Chen. 2021. PhotoChat: A human-human dialogue dataset with photo sharing behavior for joint image-text modeling. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6142–6152, Online. Association for Computational Linguistics.
- Zhang et al. (2019) Hainan Zhang, Yanyan Lan, Liang Pang, Jiafeng Guo, and Xueqi Cheng. 2019. ReCoSa: Detecting the relevant contexts with self-attention for multi-turn dialogue generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3721–3730, Florence, Italy. Association for Computational Linguistics.
- Zhang et al. (2017) Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris Metaxas. 2017. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV.
- Zhang et al. (2018) Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2204–2213, Melbourne, Australia. Association for Computational Linguistics.
- Zhang et al. (2020) Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and Bill Dolan. 2020. Dialogpt: Large-scale generative pre-training for conversational response generation. In ACL, system demonstration.
- Zhao et al. (2017) Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi. 2017. Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 654–664, Vancouver, Canada. Association for Computational Linguistics.
- Zhou et al. (2018) Hao Zhou, Minlie Huang, Tianyang Zhang, Xiaoyan Zhu, and Bing Liu. 2018. Emotional chatting machine: Emotional conversation generation with internal and external memory.