Open Domain Dialogue Generation with Latent Images

04/04/2020 ∙ by Ze Yang, et al. ∙ Beihang University Microsoft 7

We consider grounding open domain dialogues with images. Existing work assumes that both an image and a textual context are available, but image-grounded dialogues by nature are more difficult to obtain than textual dialogues. Thus, we propose learning a response generation model with both image-grounded dialogues and textual dialogues by assuming that there is a latent variable in a textual dialogue that represents the image, and trying to recover the latent image through text-to-image generation techniques. The likelihood of the two types of dialogues is then formulated by a response generator and an image reconstructor that are learned within a conditional variational auto-encoding framework. Empirical studies are conducted in both image-grounded conversation and text-based conversation. In the first scenario, image-grounded dialogues, especially under a low-resource setting, can be effectively augmented by textual dialogues with latent images; while in the second scenario, latent images can enrich the content of responses and at the same time keep them relevant to contexts.



There are no comments yet.


page 1

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Open domain dialogue generation, due to the successful application in socialbots such as Microsoft XiaoIce Shum et al. (2018) and in virtual assistants such as Amazon Alexa Ram et al. (2018), is emerging as a prominent research direction in conversational AI. Benefiting from the advances of neural sequence modeling Sutskever et al. (2014); Vaswani et al. (2017), existing work has achieved promising results on response quality Zhang et al. (2019, 2018a); Xu et al. (2019), but often makes use of only textual contexts in response generation. Human conversations, on the other hand, could be grounded by more than one kind of perception. In addition to what they read (e.g., text), people also respond according to what they hear (e.g., voice or music) and what they see (e.g., images or videos). Hence, there is a clear trend in the research of open domain dialogues that text-based single-modal conversation is moving to perceptually-grounded multi-modal conversation Mostafazadeh et al. (2017); Chu et al. (2018); Le et al. (2019); Hori et al. (2019).

Figure 1: An example of image-grounded dialogues.

We consider grounding open domain dialogues with images. Existing work Shuster et al. (2018); Huber et al. (2018) formalizes the problem as response generation (or selection) with both a given image and several turns of conversation history as contexts, and focuses on benchmarking a combination of state-of-the-art neural structures in image modeling and dialogue modeling with either crowd-sourced data Shuster et al. (2018) or selected data from social media Huber et al. (2018). Figure 1 shows an example of image-grounded dialogues used in the existing work Shuster et al. (2018). While these work provides test beds for future studies, the scale of the data (e.g., a few hundred thousand triples) could impede further progress, due to the expensive and exhausting nature of human effort, and the fact that one has to give up a large proportion of conversations in social media as they are naturally formed without images (e.g., more than % conversations in Twitter as indicated in Morris et al. (2016)). Motivated by this, we propose leveraging both multi-modal data (i.e., image-context-response triples) and large scale of single-modal data (i.e., textual dialogues) for image-grounded response generation. The key idea is that we assume that there is a latent image behind a textual conversation, and try to recover the visual background from the text. Advantages of our approach are two-fold: (1) for image-grounded conversation where an image is given first, textual dialogues with latent visual variables can augment the multi-modal data, and help alleviate the data sparsity issue; and (2) for text-based conversation where only textual dialogues are available, the latent variables can provide visual signals for response generation, and help suppress “safe responses” Li et al. (2015).

Our model consists of a response generator and an image reconstructor. The former synthesizes a response with both an image representation and a textual context representation as conditions, and is shared when both the image is explicitly available and a latent variable; while the latter infers the latent image for a textual context. Challenges then include how to define the two models and how to effectively learn them from both multi-modal and single-modal data. Encouraged by the recent progress on text-to-image synthesis, we define the image reconstructor within an attentional generative adversarial network framework Xu et al. (2018); Qiao et al. (2019) where GAN-based image generation starts from a text representation and a random noise, and grows from a small scale to a large scale by letting sub-regions in each scale attending to words of the text. The response generator is defined within an encoder-decoder framework where attentive modules in the decoder involve both attention to the textual context and attention to sub-regions of the image. Considering that an inferred image could contain noise, and words in an open domain response may not relate to the image all the time, we design a gate in response decoding to control the contribution from the visual signals in prediction of each word. The two models are jointly learned from both multi-modal data and single-modal data within a conditional variational auto-encoding (CVAE) framework, where through pre-training the image reconstructor on the multi-modal data and fixing it in learning, we can circumvent the intractable KL term in the evidence lower bound and approximate the bound with the random noise in the image generation in a way similar to the reparameterization trick Kingma and Welling (2013). By this means, the learning approach not only unifies dialogue generation and image generation, but also extends the commonly used CVAE from plain and uninterpretable variables to structured and visually-explainable variables.

We test the proposed approach in both image-grounded conversation and text-based conversation. For the first scenario, we exploit the image-chat data published in Shuster et al. (2018), and check if the model learned using both multi-modal and single-modal data can improve upon the state-of-the-art model learned solely from the multi-modal data, especially when the multi-modal data is small in scale. For the second scenario, we leverage the Reddit Conversation Corpus published by Dziri et al. (2018), and examine if latent images can provide useful signals for response generation. Evaluation results indicate that the proposed model can significantly outperform state-of-the-art models in terms of response quality in Scenario I and response informativeness in Scenario II.

Our contributions are three-fold: (1) proposal of image-grounded dialogue generation with both multi-modal and single-modal data; (2) unifying text-to-image generation and image-grounded dialogue generation within a conditional variational auto-encoding framework; and (3) empirical verification of the effectiveness of the proposed approach in both image-grounded conversation and text-based conversation.

2 Related Work

End-to-end open domain dialogue generation is inspired by machine translation when the vanilla sequence-to-sequence with attention architecture is applied to the task Shang et al. (2015); Vinyals and Le (2015) with promising results. Now, the model has been widely extended to handle the “safe response” problem Li et al. (2015); to model conversation contexts Serban et al. (2016, 2017); and to incorporate various types of knowledge for personalized Zhang et al. (2018b), emotional Zhou et al. (2018a), document-grounded Zhou et al. (2018b), and multi-modal Shuster et al. (2018) conversation. This work falls in the research of multi-modal open domain conversation in which a response is generated according to both a textual context and an image. The difference we make is that through recovering the hidden image behind a textual dialogue, we leverage both multi-modal data and single-modal data to handle the data sparsity problem in image-grounded response generation, which is never touched by the existing studies.

Our work belongs to the interdisciplinary research of vision and language among various tasks such as image captioning

Vinyals et al. (2015), visual question answering Antol et al. (2015), visual dialog Das et al. (2017), etc. Different from visual dialog which is a task of image-grounded multi-turn question-answering, the task in question is how to reply to both an image context and a textual context with proper responses in open chat. Our approach is built upon the advances of text-to-image generation techniques which are dominated by generative adversarial networks Reed et al. (2016); Zhang et al. (2017); Xu et al. (2018); Qiao et al. (2019). Rather than synthesizing an image from a caption, we consider recovering the image from a dialogue, which is encouraged by the promising results in the recent study on improving text-to-image generation by enriching caption with dialogs Sharma et al. (2018).

3 Approach

Figure 2: Architecture of our model.

We formalize the setting upon which we study the problem of image-grounded response generation, and then we elaborate the proposed models and the learning approach.

3.1 Problem Formalization

Suppose that we have an image-grounded dialogue set , where the -th triple consists of an image , a textual context with the -th utterance, and a response . Besides, we further assume that there is a textual dialogue set , where and refer to a context and a response respectively, and

. The goal is to learn two probability distributions

and with both and , and thus given a new pair in image-grounded conversation and a new context in text-based conversation, one can generate responses according to and respectively. , we assume that there is a latent variable that represents the image grounding . Then, is factorized as . Here, by encoding an explicit image and a latent image in the same way, we can define and with one model. Thus, in the later part, we use and interchangeable.

3.2 Learning Objective

We learn and by maximizing the log-likelihood of and which is given by



can be directly optimized through stochastic gradient descent, the problem is

, since is often intractable. Thus, we employ the conditional variational auto-encoding (CVAE) framework Sohn et al. (2015), and obtain the evidence lower bound (ELBO) of as:


In Equation (2),

refers to Kullback-Leibler divergence, and

is the posterior distribution of image generation. In CVAE, is often approximated by sampling with reparameterized using a deterministic function

to reduce variance

Kingma and Welling (2013). Formally, can be approximated as



denotes a normal distribution. Since the latent variable

represents an image, can be understood as reconstructing an image from (with a random noise ). Without loss of generality, we define as an image reconstructor based on text . When , is defined by , where and is the density of ; when , , where .

3.3 Image Reconstructor

The image reconstructor generates an image from text and a Gaussian random noise , and thus can be naturally modeled with generative adversarial networks Goodfellow et al. (2014) that represent the state-of-the-art technique in text-to-image (T2I) generation. Inspired by the recent progress on T2I generation, we define within the attentional generative adversarial network framework Xu et al. (2018); Qiao et al. (2019).

The left part of Figure 2 illustrates the architecture of . As a premise, is first transformed to

by a bidirectional recurrent neural network with gated recurrent units (BiGRUs)

Cho et al. (2014), where

is the hidden representation of the

-th token , and is the length of . Then,

is converted into a conditioning vector

by the conditioning augmentation algorithm Zhang et al. (2017) as input of image generation.

We then construct a stacked attentional generative network that allows multi-stage refinement in image generation. The network consists of attentional visual refiners and corresponding image generators that generate an image for from a small scale to a large scale. The generation process can be formulated as

where represents a concatenation operation, denotes the image feature matrix of sub-regions which is then fed into to generate an image , and is an attention module that encourages the sub-region feature to focus on certain words during generation. Specifically, , is given by


where is a parameter that maps to the semantic space of . consists of a fully connected layer and three upsampling layers, and , consists of two residual blocks followed by an upsampling layer. , generator is composed of a convolutional layer with activation.

The image reconstructor is pre-trained with by concatenating textual context and response as . The objective of learning is given by


where is the adversarial loss of which is defined as


In Equation (6), is the discriminator corresponding to the generator . The first term is the realism adversarial loss by which tries to fool with a generated image, and the second term is the text-image semantic consistency adversarial loss which determines if the generated image is consistent with the text condition. is alternatively trained with under an objective given by


where a real image re-scaled to adapt to . Note that we do not include the DAMSM loss Xu et al. (2018) and the STREAM loss Qiao et al. (2019) in the objective, since we find that they increase the cost of learning but do not make much difference in response generation.

After pre-training, instead of fine-tuning by optimizing , we fix the parameters of the model, that is, the parameters of and are also fixed. Thus becomes a constant and the learning objective now could be rewritten as


Fixing may make the ELBO of even loose, but it can let us circumvent the intractable KL term when is defined by a complicated non-linear function. In experiments, we find that a well pre-trained can already infer reasonable images for contexts, and thus aids response generation in both image-grounded conversation and text-based conversation. It is interesting to note that since is learned with GAN, the learning approach defined by Equation (8) in general falls in a (conditional) adversarial auto-encoding framework Makhzani et al. (2015).

3.4 Response Generator

The right part of Figure 2 shows the architecture of the response generator (). The model consists of a context encoder, an image encoder, and a response decoder. The context encoder flattens the context by concatenating the utterances as a sequence of words (length=), and transforms into a feature matrix

through a BiGRU shared with the text encoder of the image reconstructor. The image encoder is a convolutional neural netowrk (CNN) built upon the Inception-V3 model pre-trained on ImageNet. We rescale an image (either

or ) to be pixels, and then feed the image to the encoder to extract region features where is the number of sub-regions. is finally mapped to the space of by with parameters. Parameters from the Inception-V3 model are fixed during learning.

The decoder predicts a response word by word through attending to both the context feature matrix and the image feature matrix . At step , the hidden state of the decoder is calculated by


where is the embedding of the word generated at step , and refers to the hidden state at step with . Then, when predicting the -th word of the response, the decoder calculates the probability by


where and are parameters, is the size of the response vocabulary, and is a multi-modal context vector defined as


In Equation (11), and are obtained via attention over and respectively, is a gate that dynamically controls the contribution of in response generation,

is the sigmoid function, and

is a parameter. Here, the usage of gate is motivated by the consideration that (1) open domain dialogues are not always related to the image (e.g., the second turn in Figure 1); (2) even though the semantics of a response is grounded on the image (e.g., the first turn in Figure 1), a large proportion of the response can still be irrelevant to the visual content (e.g., “she needs to adjust”), and (3) an inferred image could be noisy. In these cases, a small can block noisy signals given by . The is defined by


where is the -th column of matrix , and . is defined in a similar way with replaced by .

Let , then the probability can be formulated as


4 Experiments

We test our model on two tasks: image-grounded conversation and text-based conversation. The first task requires a model to generate a response based on a textual context and a given image; while in the second task, a response is synthesized only based on the textual context.

4.1 Experimental Setup


For image-grounded conversation, we choose Image-Chat data published in Shuster et al. (2018). The dataset is made up of high quality image-grounded open domain dialogues collected from crowd-workers. Each dialogue consists of three turns at most based on a given image and two personalities. Since we focus on image-grounded conversation, the personality information in the data is discarded. For the scenario of text-based conversation, we use the Reddit Conversation Corpus111 published by Dziri et al. (2018) which contains more than 15M dialogues and each dialogue has at least utterances. We keep most frequent words in the two data as a vocabulary for the text encoder and the response decoder. Other words are replaced by unk. To reduce noise in the Reddit data, we remove dialogues in which more than % words in the response are unks, and dialogues with a response shorter than words. After pre-processing, we randomly sample M/K/K dialogues as the training/validation/test set of the Reddit data, and make sure that there is no overlap among the three sets. Statistics of the two datasets are shown in Table 1.

Evaluation Metrics.

We compare different models with both automatic metrics and human judgement. In automatic evaluation, we report perplexity (PPL) of ground-truth responses in test and measure quality of generated responses on both relevance and informativeness. In terms of relevance, besides BLEU-1 Papineni et al. (2002) and Rouge-L Lin (2004), we follow Serban et al. (2017) and employ Embedding Average (Average), Embedding Extrema (Emtrema), and Embedding Greedy (Greedy) as metrics. All the metrics are computed by scripts of a public NLG evaluation project available at In terms of informativeness, we follow Li et al. (2015) and use Distinct-1 (Dist-1) and Distinct-2 (Dist-2) as metrics which are calculated as the ratios of distinct unigrams and bigrams in responses. In human evaluation, for each task, we randomly sample examples from the test set and recruit well-educated native speakers as annotators to label the responses generated by each model. The annotators are required to judge the quality of each response from aspects including , and , and assign a score from on each aspect which means bad, fair and good, respectively. In image-grounded conversation, is judged in terms of both context and image. Each response receives scores on each aspect, and average scores over annotators and responses are used as measures. Agreement among the annotators is measured by Fleiss’ kappa Fleiss and Cohen (1973).

4.2 Baselines

The following models are selected as baselines in image-grounded conversation: (1) T&I: a multi-modal emotional response generation model proposed in Huber et al. (2018), which conducts response generation based on textual features, image features, and emotional features. In our implementation, we just take account of the textural features and the image features; (2) ImgRG: an ablation of the proposed model where only the response generator is learned with image-grounded dialogues by optimizing in Equation (1); and (3) T&I (w/T) and ImgRG (w/T): variants of baseline (1) and (2) which are trained with the Image-Chat data and the Reddit data through patching a dummy image for each textual dialogue222The RGB values of all pixels are set as (128,128,128)..

Train Valid Test
ImageChat # Images 186,782 5,000 9,997
# Dialogues 186,782 5,000 9,997
# Utterances 355,862 15,000 29,991
Reddit # Dialogues 1,000,000 20,000 20,000
# Utterances 3,304,440 67,908 66,028
Table 1: Statistics of the two datasets.

Baselines for text-based conversation include (1) Seq2Seq: sequence to sequence with attention Bahdanau et al. (2015); (2) HRED: the hierarchical recurrent encoder-decoder model proposed in Serban et al. (2016); (3) VHRED: an extension of HRED that introduces latent variables into generation Serban et al. (2017)333; and (4) ReCoSa: a hierarchical transformer-based model that exhibits state-of-the-art performance on benchmarks of text-based conversation Zhang et al. (2019)444 Note that to make the comparison fair, we train the baselines with both the Reddit training set and the text data in the Image-Chat training set.

We denote our model as ImgVAE, in which the image reconstructor is pre-trained with the Image-Chat training set, and the response generator is then trained with both the Image-Chat training set and the Twitter training set. Note that in Image-Chat, all models perform response generation with the ground-truth images in the test data.

Task Model PPL BLEU-1 Rouge-L Average Extrema Greedy Dist-1 Dist-2
Image-grounded T&I 51.52 9.13 13.3 82.28 46.56 64.85 0.12 0.32
ImgRG 51.93 12.50 14.42 85.45 49.93 67.28 0.55 1.95
T&I (w/ T) 45.75 11.91 12.89 79.46 49.15 67.21 0.21 0.47
ImgRG (w/ T) 46.19 13.61 14.72 84.65 50.73 67.97 0.88 3.06
ImgVAE 41.94 16.07 15.98 85.81 49.59 67.44 1.68 7.22
ImgVAE (w/o gate) 43.41 15.45 15.08 85.18 49.41 67.11 1.35 5.95
Text-based Seq2Seq 77.27 12.21 10.81 78.38 40.06 62.64 0.53 1.96
HRED 84.02 11.68 11.29 75.54 37.49 60.41 0.89 3.21
VHRED 78.01 12.22 11.82 75.57 39.24 62.07 0.87 3.49
ReCoSa 71.75 12.75 11.75 79.84 42.29 63.02 0.66 3.83
ImgVAE 72.06 12.58 12.05 79.95 42.38 63.55 1.52 6.34
ImgVAE (w/o gate) 72.54 12.56 11.37 79.66 42.03 63.63 1.12 4.63
Table 2: Evaluation results on automatic metrics. Numbers in bold indicate the best performing model on the corresponding metrics.

4.3 Implementation Details

In both tasks, , , , and are set as , , , and respectively. The image reconstructor has attentional visual refiners (i.e. ), and the number of image sub-regions and are set as and respectively. The dimension of and the dimension of the augmented conditioning vector are set as . We learn all models using Adam algorithm Kingma and Ba (2015) and the learning rates for image reconstructor and response generator are set as and

respectively. We terminate training when perplexity on the validation sets does not drop in 3 consecutive epochs. To stable adversarial training of the image reconstructor and avoid text representations being biased to image reconstruction, we pre-train the text encoder with seq2seq on the Reddit and textual part of Image-Chat training data, and fix the parameters in the learning of our model.

4.4 Evaluation Results

Table 2 reports evaluation results on automatic metrics. In image-grounded conversation, ImgVAE significantly outperforms all baseline models on most metrics. Particularly, ImgVAE outperforms T&I and ImgRG even after their training is augmented with the Reddit data. The results indicate the effectiveness of the proposed approach on leveraging both multi-modal data and single-modal data for image-grounded dialogue generation. In text-based conversation, ImagVAE achieves comparable performance with the state-of-the-art deep transformer structure (i.e., ReCoSa) in terms of response relevance and PPL, but improves upon informativeness of responses with large margins. This is because latent images, when properly controlled by the gate in the response generator, can enhance appearance of informative content in responses, as will be further verified by human annotations and the analysis in Discussions.

Table 3 reports human evaluation results. Basically, all models in both tasks can generate fluent and grammatical responses for most test input. In image-grounded conversation, ImgVAE outperforms all baselines in terms of context-relevance, image-relevance, and richness, which is consistent with the automatic evaluation results. In text-based conversation, ImgVAE significantly improves upon richness. which further demonstrates the effect of latent images. All kappa values exceed or close to , indicating substantial agreement among the annotators. For reference, we show two cases in supplementary material.

Task Models Fluency Relevance Richness Kappa
Text Image
Image-grounded T&I 1.89 0.82 0.78 0.74 0.57
ImgRG 1.82 0.86 0.85 0.80 0.60
T&I (w/ T) 1.90 1.16 0.92 0.97 0.62
ImgRG (w/ T) 1.86 1.23 1.04 1.08 0.58
ImgVAE 1.91 1.42 1.29 1.38 0.65
Text-based Seq2Seq 1.87 1.21 - 0.92 0.62
HRED 1.88 1.12 - 0.78 0.70
VHRED 1.66 1.05 - 1.10 0.61
ReCoSa 1.87 1.32 - 1.12 0.63
ImgVAE 1.86 1.48 - 1.47 0.63
Table 3: Human evaluation results.

4.5 Discussions

In addition to the comparison with baselines, we are also curious about Q1: what is the performance of ImgVAE when image-grounded dialogues for training become more and more scarce? Q2: what content in responses is enriched by the latent images in text-based conversation? and Q3: what is the effect of the gate in the response generator in text-based dialogue generation?

Answer to Q1:

Figure 3 illustrates the performance of ImgVAE and the baselines in terms of PPL and Rouge-L when the training size of Image-Chat is gradually halved. Note that in training, the size of Reddit data is kept unchanged. We can see that when the multi-modal training resource becomes more and more scarce, all baseline models suffer from dramatic performance drop. Particularly, since T&I and ImgRG count on the image-chat data, their performance drops faster than the others. This is because the baseline models, although some of them have been augmented with the textual dialogues in a trivial way, tend to overfit the small training data, and then generalize badly on the test set. On the other hand, benefiting from the large scale textual dialogues with latent images, ImgVAE exhibits robust performance in test with respect to the shrinkage of the training size of Image-Chat, and the advantage over the baselines becomes bigger and bigger with the reduction of image-grounded dialogues. The results demonstrate the efficacy of the proposed method against data sparsity in low-resource image-grounded dialogue generation.

(a) PPL
(b) Rouge-L
Figure 3: Performance of the models on small multi-modal training data.

Answer to Q2:

we define two new metrics in text-based conversation:


where refers to the test set of the Reddit data, is a context-response pair, is a response generated accroding to , returns a set of topical words in sentence , measures the size of a set, and returns the length of . We refer nouns and verbs as topical words, and recognize the POS tag of a word in a sentence with NLTK POS Tagger555Tags in question include: NN, NNS, NNP, NNPS, VB, VBD, VBG, VBN, VBP, VBZ.. measures the average proportion of topical words in generated responses, while - further excludes topical words appearing in contexts. Table 4 gives the results on the two metrics. We can see that the latent images significantly enhance the ratio of topical words and the ratio of extra topical words in responses. Seq2Seq has a high score but the lowest - score, because it tends to copy words from contexts in response synthesis. and - for human responses in the test set are and respectively. This means that even though ImgVAE can enhance appearance of informative content, it is still not so good as humans at bringing in new content and thus extending conversation, which could be a future direction for informative response generation.

Models Seq2Seq HRED VHRED ReCoSa ImgVAE
0.406 0.332 0.317 0.349 0.428
- 0.239 0.249 0.248 0.264 0.278
Table 4: Results on topical metrics.

Answer to Q3:

first of all, the quantitative evaluation in Table 2 indicates that removing the gate from the response generator (i.e., ImgVAE (w/o gate)) in general will cause performance drop on both tasks. Secondly, we find that when semantics of a textual context becomes complicated (e.g., with more nouns and verbs, and thus the topics become diverse in open chat), it is usually too challenging to recover a quality image from the context. Then, the gate shrinks, making the effect of the latent image (e.g., in Equation (11)) fade in generation. Figure 3(a) illustrates the distribution of average gate values of responses where the x-axis represents bins of test dialogues according to the number of topical words the contexts contain666Only 0.02% of the full 15M Reddit data do not contain a topical word. In the randomly sampled test data, the only 4 contexts without topical words are excluded from the analysis.. Numbers below indicate how many test dialogues fall in the corresponding bins. We observe clear drop when the number of topical words in contexts increases. Another explanation is that a context with rich content can already provide enough information for response generation, and thus the latent image becomes marginal. Finally, Figure 3(b) compares gates for topical words and gates for stop words777We obtain stop words (totally 179) by NLTK toolkit available at on test contexts with no more than 5 topical words. We find that stop words are less grounded by the latent images than topical words, even though the latent images are relatively useful on these examples.

Figure 4: Analysis on the effect of the gate.

5 Conclusions

We consider multi-modal response generation with both image-grounded dialogues and textual dialogues by recovering the latent image behind a textual dialogue with an image reconstructor. The reconstructor is jointly learned with a response generator within a conditional variational auto-encoding framework. Evaluation results indicate the efficacy of the proposed approach in both image-grounded conversation and text-based conversation.


  • S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh (2015) Vqa: visual question answering. In

    Proceedings of the IEEE international conference on computer vision

    pp. 2425–2433. Cited by: §2.
  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations, Cited by: §4.2.
  • K. Cho, B. van Merriënboer, D. Bahdanau, and Y. Bengio (2014) On the properties of neural machine translation: encoder–decoder approaches. Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pp. 103–111. External Links: Document, Link Cited by: §3.3.
  • H. Chu, D. Li, and S. Fidler (2018) A face-to-face neural conversation model. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 7113–7121. Cited by: §1.
  • A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. Moura, D. Parikh, and D. Batra (2017) Visual dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 326–335. Cited by: §2.
  • N. Dziri, E. Kamalloo, K. W. Mathewson, and O. Zaiane (2018) Augmenting neural response generation with context-aware topical attention. arXiv preprint arXiv:1811.01063. Cited by: §1, §4.1.
  • J. L. Fleiss and J. Cohen (1973) The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and psychological measurement 33 (3), pp. 613–619. Cited by: §4.1.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §3.3.
  • C. Hori, H. Alamri, J. Wang, G. Wichern, T. Hori, A. Cherian, T. K. Marks, V. Cartillier, R. G. Lopes, A. Das, et al. (2019) End-to-end audio visual scene-aware dialog using multimodal attention-based video features. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2352–2356. Cited by: §1.
  • B. Huber, D. McDuff, C. Brockett, M. Galley, and B. Dolan (2018) Emotional dialogue generation using image-grounded language models. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 277. Cited by: §1, §4.2.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §4.3.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §3.2.
  • H. Le, D. Sahoo, N. Chen, and S. Hoi (2019)

    Multimodal transformer networks for end-to-end video-grounded dialogue systems

    In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5612–5623. Cited by: §1.
  • J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan (2015) A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 110–119. Cited by: §1, §2, §4.1.
  • C. Lin (2004) Rouge: a package for automatic evaluation of summaries. Text Summarization Branches Out. External Links: Link Cited by: §4.1.
  • A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey (2015)

    Adversarial autoencoders

    arXiv preprint arXiv:1511.05644. Cited by: §3.3.
  • M. R. Morris, A. Zolyomi, C. Yao, S. Bahram, J. P. Bigham, and S. K. Kane (2016) With most of it being pictures now, i rarely use it: understanding twitter’s evolving accessibility to blind users. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pp. 5506–5516. Cited by: §1.
  • N. Mostafazadeh, C. Brockett, B. Dolan, M. Galley, J. Gao, G. Spithourakis, and L. Vanderwende (2017) Image-grounded conversations: multimodal context for natural question and response generation. In

    Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

    pp. 462–472. Cited by: §1.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318. External Links: Link Cited by: §4.1.
  • T. Qiao, J. Zhang, D. Xu, and D. Tao (2019) MirrorGAN: learning text-to-image generation by redescription. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1505–1514. Cited by: §1, §2, §3.3, §3.3.
  • A. Ram, R. Prasad, C. Khatri, A. Venkatesh, R. Gabriel, Q. Liu, J. Nunn, B. Hedayatnia, M. Cheng, A. Nagar, et al. (2018) Conversational ai: the science behind the alexa prize. arXiv preprint arXiv:1801.03604. Cited by: §1.
  • S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee (2016) Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396. Cited by: §2.
  • I. V. Serban, A. Sordoni, Y. Bengio, A. C. Courville, and J. Pineau (2016) Building end-to-end dialogue systems using generative hierarchical neural network models. In AAAI, Vol. 16, pp. 3776–3784. Cited by: §2, §4.2.
  • I. V. Serban, A. Sordoni, R. Lowe, L. Charlin, J. Pineau, A. C. Courville, and Y. Bengio (2017) A hierarchical latent variable encoder-decoder model for generating dialogues. In AAAI, pp. 3295–3301. Cited by: §2, §4.1, §4.2.
  • L. Shang, Z. Lu, and H. Li (2015) Neural responding machine for short-text conversation. In ACL, pp. 1577–1586. Cited by: §2.
  • S. Sharma, D. Suhubdy, V. Michalski, S. E. Kahou, and Y. Bengio (2018) Chatpainter: improving text to image generation using dialogue. arXiv preprint arXiv:1802.08216. Cited by: §2.
  • H. Shum, X. He, and D. Li (2018) From eliza to xiaoice: challenges and opportunities with social chatbots. Frontiers of Information Technology & Electronic Engineering 19 (1), pp. 10–26. External Links: Document, Link Cited by: §1.
  • K. Shuster, S. Humeau, A. Bordes, and J. Weston (2018) Engaging image chat: modeling personality in grounded dialogue. arXiv preprint arXiv:1811.00945. Cited by: §1, §1, §2, §4.1.
  • K. Sohn, X. Yan, and H. Lee (2015) Learning structured output representation using deep conditional generative models. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, Cambridge, MA, USA, pp. 3483–3491. External Links: Link Cited by: §3.2.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, Cambridge, MA, USA, pp. 3104–3112. External Links: Link Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Link Cited by: §1.
  • O. Vinyals and Q. Le (2015) A neural conversational model. arXiv preprint arXiv:1506.05869. Cited by: §2.
  • O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2015) Show and tell: a neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164. Cited by: §2.
  • C. Xu, W. Wu, C. Tao, H. Hu, M. Schuerman, and Y. Wang (2019) Neural response generation with meta-words. arXiv preprint arXiv:1906.06050. Cited by: §1.
  • T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He (2018) Attngan: fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1316–1324. Cited by: §1, §2, §3.3, §3.3.
  • H. Zhang, Y. Lan, L. Pang, J. Guo, and X. Cheng (2019) ReCoSa: detecting the relevant contexts with self-attention for multi-turn dialogue generation. In Proceedings of the 57th Conference of the Association for Computational Linguistics, pp. 3721–3730. Cited by: §1, §4.2.
  • H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas (2017) Stackgan: text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5907–5915. Cited by: §2, §3.3.
  • R. Zhang, J. Guo, Y. Fan, Y. Lan, J. Xu, and X. Cheng (2018a) Learning to control the specificity in neural response generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 1108–1117. Cited by: §1.
  • S. Zhang, E. Dinan, J. Urbanek, A. Szlam, D. Kiela, and J. Weston (2018b) Personalizing dialogue agents: i have a dog, do you have pets too?. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2204–2213. Cited by: §2.
  • H. Zhou, M. Huang, T. Zhang, X. Zhu, and B. Liu (2018a) Emotional chatting machine: emotional conversation generation with internal and external memory. In AAAI, pp. 730–738. Cited by: §2.
  • K. Zhou, S. Prabhumoye, and A. W. Black (2018b) A dataset for document grounded conversations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 708–713. Cited by: §2.

Appendix A Supplementary Material

Table 5 shows an example from the test set of Image-Chat. We can see that the response from ImgVAE is not only coherent with the textual context, but also well grounded by the content of the image. On the other hand, responses from the baseline models are in general scarce in terms of content and formed with generic patterns like “i don’t ” and “i ’m not sure ”. Table 6 gives an example from the test set of Reddit. Responses from the baseline models are either generic patterns (e.g., “i don’t know what you ’re talking about.” from Seq2Seq, and “i ’m not sure if you ’re joking or not” from VHRED) or irrelevant with the context (e.g., “what about pineapple?” from ReCoSa, and “i ’ve never had a bbq sauce.” from HRED). For ImgVAE, the left and the right are the latent images recovered by and respectively. Even though the quality of the latent image is not good enough (note that generating high quality photo-realistic image is NOT the focus of the work), it still provides useful visual signals and encourages appearance of relevant content word like “meat” in the response.

A: I would love to play and chase these kids around with those toys.
B: I love toys like this using them makes me so happy and chilled out.
T&I: i do n’t like it .
ImgRG: i would love to be there .
T&I (w/ T): i ’m not sure if that ’s true .
ImgRG (w/ T): i love toys !
ImgVAE: it would be so much fun to take a picture with them .
Table 5: Case study for image-grounded conversation.
A: what toppings do you put on your pizza?
B: i ’m cool with any meat really. bacon, sausage, ham.
Seq2Seq: i don’t know what you ’re talking about .
HRED: i ’ve never had a bbq sauce .
VHRED: i ’m not sure if you ’re joking or not .
ReCoSa: what about pineapple ?
ImgVAE: i ’m a meat lover .
Table 6: Case study for text-based conversation.