Multimodal Dialogue Response Generation

by   Qingfeng Sun, et al.

Responsing with image has been recognized as an important capability for an intelligent conversational agent. Yet existing works only focus on exploring the multimodal dialogue models which depend on retrieval-based methods, but neglecting generation methods. To fill in the gaps, we first present a multimodal dialogue generation model, which takes the dialogue history as input, then generates a textual sequence or an image as response. Learning such a model often requires multimodal dialogues containing both texts and images which are difficult to obtain. Motivated by the challenge in practice, we consider multimodal dialogue generation under a natural assumption that only limited training examples are available. In such a low-resource setting, we devise a novel conversational agent, Divter, in order to isolate parameters that depend on multimodal dialogues from the entire generation model. By this means, the major part of the model can be learned from a large number of text-only dialogues and text-image pairs respectively, then the whole parameters can be well fitted using the limited training examples. Extensive experiments demonstrate our method achieves state-of-the-art results in both automatic and human evaluation, and can generate informative text and high-resolution image responses.



page 6

page 7

page 8


Low-Resource Knowledge-Grounded Dialogue Generation

Responding with knowledge has been recognized as an important capability...

A Knowledge-Grounded Multimodal Search-Based Conversational Agent

Multimodal search-based dialogue is a challenging new task: It extends v...

Improving Context Modelling in Multimodal Dialogue Generation

In this work, we investigate the task of textual response generation in ...

Open Domain Dialogue Generation with Latent Images

We consider grounding open domain dialogues with images. Existing work a...

Diversifying Dialogue Generation with Non-Conversational Text

Neural network-based sequence-to-sequence (seq2seq) models strongly suff...

The JDDC 2.0 Corpus: A Large-Scale Multimodal Multi-Turn Chinese Dialogue Dataset for E-commerce Customer Service

With the development of the Internet, more and more people get accustome...

Building Goal-Oriented Dialogue Systems with Situated Visual Context

Most popular goal-oriented dialogue agents are capable of understanding ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: An example of conversations between human and agent. Both the text and the image responses in green box are all the generated results of a multimodal dialogue response generation model.

With the development of instant messaging technology in the recent decades, the intermediary of online conversation has also changed from pure text to a variety of visual modalities (e.g. image, gif animation, short video). Similar to communicating by the messenger tools (e.g. Facebook, WhatsApp, WeChat) in reality, a good intelligent conversational agent should not only be able to converse freely with plain text, but also have the ability to perceive and share the real physical world, which is also an important pursuit of artificial intelligence (AI). Although recently some large-scale pre-trained text-only dialogue generation models, such as DialoGPT

Zhang et al. (2020), Blender Roller et al. (2021), Meena Adiwardana et al. (2020), have shown excellent performance, they still cannot rely exclusively on plain text to completely simulate the rich experience of visual perception.

Recently, various vision-language tasks have been introduced and attracted widespread attention, such as visual question answering Ren et al. (2015); Gao et al. (2015); Lu et al. (2016); Anderson et al. (2018); Li et al. (2019a); Huang et al. (2020)

, image captioning

Xu et al. (2015); Anderson et al. (2016); Ghanimifard and Dobnik (2019); Cornia et al. (2020), image-grounded dialogue Das et al. (2017); Yang et al. (2021); Agarwal et al. (2020); Qi et al. (2020); Chen et al. (2021); Liang et al. (2021), photo sharing Zang et al. (2021). Specifically, photo sharing, which aims to selecting and sharing the image based on the textual context, is a challenging task that requires models to understand the background story which complemented by human imaginations, rather than to locate related visual objects or explicitly mention main visible content in the image as the previous works do. Zang et al. (2021)

propose a retrieval-based method to resolve above challenge. However, the performance of retrieval-based method is limited in specific domains by the size of the pre-constructed conversational history repository, especially for long tail contexts those are not covered in the history, where the set of image responses of a retrieval system is also fixed. On the other hand, to make an image reply appropriately for the context, a better way is to generate a new one accordingly. With the prosperity of pre-trained networks powered by deep learning, image generation models are developing good flexibility and high-resolution quality

Ramesh et al. (2021); Esser et al. (2021); Ding et al. (2021).

In this paper, we formulate a new problem: Multimodal Dialogue Response Generation (MDRG). As the conversations shown in Figure 1, given the dialogue context, the model should decide to generate an informative text or a high-resolution image as response. We argue that there still some hindrances to go for application in real scenarios, since (1) the sophisticated neural end-to-end architecture will just overfit to very few well-annotated training data (e.g., a few existing 8k multimodal dialogues). An evidence is that when discussing the contents out of the domain of the training data, its performance drops dramatically, as will be seen in our experiments; and (2) as human effort is expensive, it is difficult to collect enough training data for a new domain. Based on the above facts, we take a step further to extend the assumption of MDRG to a low-resource setting where only a few multimodal dialogues containing both texts and images are available.

To tackle the above challenges, we are aiming to have the first attempt to incorporate text-to-image generation into text-only open domain dialogue generation. The key idea is to make parameters that rely on multimodal dialogues small and independent by disentangling textual dialogue response generation and text-to-image generation, and thus we can learn the major part of the generation model from text-only dialogues and <text description, image> pairs that are much easier to be obtained. Specifically, we present Divter, a novel conversational agent powered by large-scale visual world experiences. As shown in Figure 2, Divter is made up of two Transformer-based Vaswani et al. (2017a) components: a textual dialogue response generator, and a text-to-image generator. The textual dialogue response generator takes the dialogue context as input, then determines the modality of following response, and generates a textual response or a textual description of image. The text-to-image generator takes above textual image description as condition, then generates a realistic and consistent high resolution image as final response. Both components are independent with the opposite knowledge, and thus can be pre-trained using text-only dialogues and (text description, image) pairs respectively. The end-to-end Divter depends on the multimodal dialogues constructed as (text dialogue context, text response / (text description, image)

) tuple, but the joint learning and estimation of the two components just require a few training examples depending on specific domains or tasks. By fine-tuning the pre-trained parameters, we can adapt the model to a new domain with only little cost.

Contributions of this work are three-fold:

  • To the best of our knowledge, it is the first work on the multimodal dialogue response generation. We explore the task under a low-resource setting where only a few multimodal dialogues are assumed available.

  • We present Divter, a novel conversational agent consisting of two flexible components, which can effectively understand dialogue context and accordingly generate informative text and high-resolution image responses.

  • Extensive experiments on PhotoChat Corpus Zang et al. (2021) indicate the effectiveness of Divter.

Figure 2: The overview of our multimodal dialogue response generation model. The Textual Dialogue Response Generator takes the dialogue context as input and generates a text response or a textual image description (e.g., “a brown and beautiful kitten lies on the bed.”). The “DST” means the next text sequence is a description . With as a condition, the Text-to-Image Transformer generates an image code . The Image Decoder reconstructs to a realistic and consistent high resolution image as dialogue response.

2 Related Work

2.1 Textual Dialogue Response Generation

End-to-end response generation for textual open domain dialogues is inspired by the successful application of neural sequence-to-sequence models on machine translation Sutskever et al. (2014). On top of the basic architecture Shang et al. (2015); Vinyals and Le (2015), the vanilla encoder-decoder method is widely extended to address the key challenges in open-domain dialogue systems, including improving diversity of responses Li et al. (2016a); Zhao et al. (2017); Tao et al. (2018), modeling conversation contexts Serban et al. (2016); Xing et al. (2017); Zhang et al. (2019), controling attributes of responses See et al. (2019); Zhou et al. (2018), biasing responses to some specific personas Li et al. (2016b); Zhang et al. (2018), incorporating extra knowledge into generation Dinan et al. (2019); Ghazvininejad et al. (2018); Kim et al. (2020), and building general pre-trained agent Adiwardana et al. (2020); Zhang et al. (2020); Roller et al. (2021). Different from the previous works on open domain dialogue response generation that converse freely with plain text, our work lies in the research of multimodal response generation where an image response or a text response is generated with a given dialogue context.

2.2 Text-to-Image Generation

In the research of text-to-image generation, various works have been extensively studied. Mansimov et al. (2016) shown the Draw generative model Gregor et al. (2015) could generate images from natural language descriptions. Reed et al. (2016)

proposed a generative adversarial network to improve the image fidelity. Then some improvement methods continue to optimizing the generation architecture, such as stacked generators

Zhang et al. (2017), attentional network Xu et al. (2018), and extra knowledges Li et al. (2019b). Nguyen et al. (2017) we provided a unified probabilistic interpretation of related activation maximization methods to produce high quality images at higher resolutions. Separately, Cho et al. (2020) use uniform masking with a large range of masking ratios and align the right pre-training datasets to the right objectives. More recently, Ramesh et al. (2021)

propose a transformer-based method that autoregressively models the text and image tokens as a single stream of data. Their model was pre-trained on a large-scale dataset of 250 million text-image pairs and has shown competitive performance with previous domain-specific models when evaluated in a zero-shot fashion. For this multimodal response generation scenario, we use the text description of image to bridge above textual dialogue generation and text-to-image generation models, where the text description is output of the former and input of the latter in a low-resource setting.

In the following sections, we will elaborate our approach to learning a multimodal response generation model with multimodal dialogues , text-only dialogues , and (text description, image) pairs .

3 Problem Formalization

Suppose that we have dataset , where , is the context of the dialogue with the -th utterance, and is the response regarding to . Specifically, has two types: a text , or a pair where is an image with its text description . In addition to , we further assume that there are and with a (dialogue context, response) pair, and a (text description, image) pair, where and . and .

The goal is to learn a generation model ( denotes the parameters of the model) with . Thus, given a new dialogue context , one can generate a text response or an image response following .

4 Approach

Our idea is inspired by the observation on the nature of real-world open domain multimodal dialogues: (1) despite the fact that a multimodal dialogue contains an image somewhere, utterances in the dialogue are not always related to ; (2) the main content and semantic in an image often can be described with a synonymous text. Therefore, given a dialogue context, we postulate that formation of a response could be decomposed into four uncorrelated actions: (i) determining whether to () generate a text response or () generate an image response ; (ii) if , generating a text response; if , (iii) generating a text description; (iv) generating synonymous image response with the text description as a condition. All the actions can be independently learned, which becomes the key to aiding the small with the large and .

Figure 2 illustrates the architecture of our model. The model is made up of two components: a textual dialogue response generator and a text-to-image generator . In the rest of this section, we will elaborate these two modules in detail.

4.1 Textual Dialogue Response Generator

The textual dialogue response generator (for actions (i)(iii)) is a sequence-to-sequence model based on the Transformer architecture Vaswani et al. (2017b), it consists of a 24-layers Transformer with a hidden size of 1024 and 16 heads. Specifically, given a dialogue context as source, if the target is a text response with the -th word in the sequence, the generation loss is defined by


and if the target is a text image description with a tag means the next text sequence is a description, the generation loss is defined by


Inference   Given a new dialogue context , when a generated description occurs, it will be fed into the following text-to-image generator, then reconstructed to an image as the last response.

4.2 Text-to-Image Generator

The text-to-image generator (for action (iv)) consists of an image representation module and a text-to-image transformer module .

4.2.1 Image Representation Modeling

Inspired by the success of DALLE Esser et al. (2020) and VQGAN Ramesh et al. (2021), to utilize the highly expressive transformer architecture for image generation, we need to express the constituents of an image in the form of a sequence.

Inheriting from Ramesh et al. (2021), we use a learned discrete codebook to represent the image, such that any image with the height and the width, it can be represented by a spatial collection of codebook entries , where is the dimensionality of codes. By this means, the equivalent representation of image is a sequence of indices which specify the respective entries in the learned, discrete codebook . Then, we take a convolutional model consists of an encoder and a decoder together to learn to represent images with codes from . The function of is similar to the word embedding representation Mikolov et al. (2013); Pennington et al. (2014); Peters et al. (2018)

in natural language processing (NLP). This would help the model to align the cross-modal representations between text and visual regions.

Specifically, given an image , we use the encoder to produce its spatial representation , then under the action of element-wise quantization of , each spatial code will be mapped to its closest codebook entry :


Then, the reconstruction can be approximated by


We modeling as an image representation module , and the learning details of could be found in Ramesh et al. (2021), they also adapt sliding window approach to generate high-resolution (e.g., 256 × 256 pixels) images.

4.2.2 Text-to-Image Transformer

The text-to-image transformer is also a sequence-to-sequence model based on the Transformer architecture, it consists of a 24-layers Transformer with a hidden size of 1024 and 16 attention heads. Specifically, given an image and its text description . With and available, we can represent in terms of the codebook-indices of its embeddings, the quantized embedding of is given by Eq. (5) and is equivalent to a sequence of indices from the codebook


Thus we can obtain the by mapping the indices to their corresponding codebook entries. Then we concatenate and to a single stream


and train an autoregressive transformer to model the joint distribution over the text and image tokens, the generation loss is defined by


Inference   Given a new text description , we leverage the text-to-image generator to reconstruct its synonymous image . More precisely, the text-to-image transformer first takes as input and generates its image representation , then the function maps to the closest codebook entries by the codebook , then uses the image decoder to reconstruct to an image .

4.3 Learning Details

Let us denote as the parameters of textual dialogue response generator , image representation module and text-to-image transformer module . In the pre-training stage, we use textual dialogues to estimate

, use the ImageNet

Deng et al. (2009) to estimate , use (text description, image) pairs to estimate . Then in the fine-tuning stage for MDRG scenario, we fix the , then jointly fine-tune the and with , thus the final objective of our multimodal response generation model is to minimize the integrated loss


where is a hyper parameter.

Remarks.   In this work, we mainly focus on integrating text and image responses generation, but our proposed approach actually provides a recipe for a general solution to low-resource multimodal dialogue response generation in which the target modality (if not text) could be gifs, videos, or speech sounds, etc. To do that, one only needs to modify the text-to-image generator to make it compatible with the specific modality type, then pre-train a new text-to-<target modality> generator.

5 Experiments

5.1 Dataset

To evaluate the performance of Divter, we conduct comprehensive experiments on the PhotoChat dataset released by Zang et al. (2021), which is a multi-turn multimodal conversations consists of 10917 unique images and 12286 dialogues, each of which is paired with an user image that is shared during the conversation, and each image is paired with its text description. The dialogues has four categories: people, food, animals, and daily products. The dataset has been split into 10286 train, 1000 dev, and 1000 test sets.

5.2 Evaluation Metrics

We conduct evaluation with both automatic metrics and human judgements. For automatic evaluation, we focus on four aspects: (1) Image Response Intent Prediction, the goal of this task is to predict whether a image should be produced in the next turn for given context; (2) Text Description Generation; (3) Image Generation Quality ; (4) Text Response Generation. For (1), we follow Zang et al. (2021), which formulates the problem as a binary classification task, and use F1 as metric; for (2) and (4), we use PPL, BLEU (Papineni et al., 2002), Rouge (Lin, 2004) and F1; for (3) we follow Ramesh et al. (2021) and use Frechet Inception Distance (FID) and Inception Score (IS).

Models Intent Text Description Generation Image Generation Text Response Generation
F1 PPL B-1 B-2 Rouge FID IS PPL B-1 B-2 Rouge
BERT-base 53.2
T5-3B 58.9
Divter 56.2 5.12 15.08 11.42 15.81 29.16 15.8 0.6 59.63 6.52 1.66 5.69
Divter (w/o pre-train) 47.3 122.56 1.99 0.23 2.60 29.78 15.5 0.5 153.62 4.82 0.53 3.83
Divter (w/o pre-train) 55.9 5.23 15.01 11.20 15.63 262.09 4.9 0.7 63.76 6.28 1.51 5.40
Divter (w/o , pre-train) 47.1 128.87 1.75 0.21 2.38 254.31 5.2 0.6 163.85 4.53 0.48 3.55
Divter (w/o joint learning) 55.6 5.20 15.00 11.36 15.73 29.04 15.4 0.6 59.21 6.47 1.58 5.63
Table 1: Automatic evaluation results of Divter and baselines on the test set. (w/o joint learning) means fine-tuning and respectively rather than using Eq. 11

. Numbers in bold mean that the improvement to the best baseline is statistically significant (t-test with

-value 0.01).
Models Context Fluency Image Background Kappa
Coherence Quality Consistency
SCAN(Retrieval) 1.95 0.96 0.65
Divter (w/o pre-train) 0.94 1.56 0.61 0.35 0.66
Divter 1.59 1.95 1.83 1.61 0.63
Table 2: Human evaluation results.
Table 3: Qualitative assessment of various model and variants for image response generation with the same textual dialogue context as input in PhotoChat test set. 1st column: Divter . 2nd column: Divter w/o pre-train. 3rd column: Divter w/o pre-train.

For human evaluation, we randomly sample 200 dialogue contexts and the corresponding generated responses (100 text responses and 100 image responses respectively) from PhotoChat for Divter and its variants without any pre-training. Three human annotators are asked to score the response quality on a scale of 0, 1, 2 from four aspects: (1) Context Coherence: Whether the text response is coherent with the context and guides the following utterances; (2) Fluency: Whether the text response is natural and fluent; (3) Image Quality: The quality (including definition and integrity) of the image response; (4) Background Consistency of Image: For each dialogue, We select the top-8 generated/retrieved images group and ask the annotators to decide whether the group is consistent with the dialogue background, a qualitative assessment is also shown in Table 5. we report the average scores over three annotators, and the higher score means the better. The agreement among the annotators is measured via Fleiss’ Kappa Fleiss (1971).

5.3 Implementation Details

For the textual dialogue response generator , we use DialoGPT Zhang et al. (2020) as pre-trained model initialization, which has been trained on 147M conversation-like exchanges extracted from Reddit comment chains over a period spanning from 2005 through 2017. In the fine-tuning stage, we concatenate the context turns with token [SEP] as a single sequence, we adopt Adam optimizer as an initial learning rate of 1e-5, and the batch size is 256, the training of PhotoChat is conducted on 16 Nvidia Tesla V100 32G GPU cards. We use beam search(size=5) to decode the text sequence.

For the image representation module , we inherit the model222 released by Ramesh et al. (2021)

, which has been trained on 10M ImageNet images, the model enables image generation beyond a resolution of 256

256 pixels.

For the text-to-image transformer , we randomly select 5M (categorical text description, image) pairs from ImageNet, and 10M (text description, image) pairs from YFCC100M Thomee et al. (2016) as training data, we set the text descripition length as 32 and the codebook entries length as 256 (1616) and concatenate them with [SEP] as a sequence with length of 289. We then pre-train the model for 3.5 million steps with a batch size of 256 accumulated on 16 Nvidia Tesla V100 32G GPU cards. In the fine-tuning stage, we train the PhotoChat for 50000 steps. In the inference stage, we use CLIP Radford et al. (2021) to rerank the generated 256 samples, CLIP assigns a score based on how well the image matches the description, we select the best one as response.

In the joint learning scenario, we first train the for 48000 steps, then jointly train and for 2000 steps. The in Eq. 11 is 0.2. We discard the sentences with the prefix of “The photo has your *” in descriptions. Early stopping on validation is adopted as a regularization strategy. All the hyper parameters are determined by grid search.

Example 1 Example 2 A: OMG…the new ice cream shop is amazing.      …… A: I had the twist chocolate and vanilla but it was so fresh tasiting. like you just made it. like you just made it. B: I call it the malado gilato. A: Sam wouldn’t let me have another lick bc he thought I’d eat it all. D: That sounds interesting. D: Yes, could you please share it with me? D: Objects in the photo: Chocolate Ice cream, Dairy, Drink. D:            D: Wow! The ice cream looks so delicious. D: Sure, it tastes pretty good. A: Have you been out in nature lately? B: Yes.      …… A: I’m sitting at home now looking through some old photographs. B: I see. than. A: Would you like to see one of my favorites It’s a cool shot of a honey bee near a beautiful flower. D: Objects in the photo: Honey bee, Insect, Animal, Flower. D:            D: It is a nice picture. Thank you for sharing. D: Haha, just enjoy the beautiful scenery. D: Yeah, definitely.
Table 4: Examples of PhotoChat test set. In each example, the turns with the prefix of “A”/“B” are the given context; the blue text is the text description generated by Divter; the left image and the red response are generated by Divter, the right image is the ground-truth image.

5.4 Baselines

The following two pre-trained models are selected as baselines to measure the “Image Response Intent Prediction” task in Section 5.2:

BERT-base BERT Devlin et al. (2019) is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.

T5-3B T5 Raffel et al. (2020) is a pretrained model which has achieved remarkable performance in many NLP tasks.

We adopt above models as baselines and use the results reported by Zang et al. (2021).

SCAN is proposed by Lee et al. (2018)

, the model is a full cross attention model that captures the fine-grained interplay between image regions and text tokens to infer image-text similarity, SCAN achieves state-of-the-art performance of the “Image Retrieval” task on PhotoChat.

5.5 Evaluation Results

As shown in Table 1, our Divter not only achieves comparable performance with the state-of-the-art retrieval-based image response intent prediction model, but also achieves remarkable performance in all the generation metrics. This indicates that Divter can accurately judge the timing of generating image response with the given dialogue context, and produce text responses which are coherent to the context, and generate high-quality image responses. The significant performance gap between Divter and those models without pre-training indicates the superiority of our proposed learning strategy. Table 2 reports the results of human evaluation, our Divter also significantly outperforms the baselines on most of the aspects.

Generated / Retrieved Images Generation (Generated Description: objects in the photo: animal, dog, carnivore, grassland) Retrieval (Should contain “dog”.) Generation (Generated Description: objects in the photo: coffee cup, drink, bottle, mug, tea.) Retrieval (Should contain “coffee cup”.) Generation (Generated Description: objects in the photo: curtain.) Retrieval (Should contain “curtain”.)
Table 5: Examples of the images generated by Divter and the images retrieved by SCAN.

5.6 Ablation Study

We conduct extensive ablation experiments over different variants to better understand their relative importance to the MDRG task. As shown in Table 1, all the variants lead to worse performance in most of the metrics. For more intuitive comparison, the qualitative assessment results are also shown in Table 3. In particular, both quantitative and qualitative results on the ablation study validate that: (1) pre-training is crucial to low-resource multimodal dialogue response generation, since removing any component from pre-training causes performance drop when training data is small; (2) in terms of impact to performance of image generation, >

, in terms of impact to performance of text generation,

> ; (3) The joint learning also has contribution to Divter, indicating that leveraging the integrated learning of textual context and visual image with the description benefits more contrasting for any single one of them.

5.7 Case Study

To further investigate the quality of multimodal responses generated by Divter, we show two examples on the PhotoChat test data in Table 4. The given context of the first one is talking about “ice cream”, and the second one is talking about “honey bee”. As we can see, Divter can not only generate a realistic high-resolution image which is coherent to the background, but also generate the informative text responses grounded on the image. Separately, The high-quality of the generated images are comparable to those real-world ground truths, which demonstrates the practicability of Divter.

5.8 Discussions

Benefits over retrieval-based methods To further investigate and compare the generalization capability between Divter and the retrieval-based method, we also sample top-8 generated images from Divter and equivalent retrieved images from SCAN model. As shown in Table 5, on the one hand, the diversity and richness of the generated images are desirable, on the other hand, those retrieved results often suffer from wrong consistency with dialogue background. For example in the second case, the dialogue is talking about “coffee”, but the retrieved images contain some uncorrelated objects like “milk”, “cake”, “dog’ and “snack”. And in the third example, all the retrieval results are mistaken since there is little “curtain” in training and retrieval space. This demonstrates the fact that the performance of retrieval-based method is extremely limited in specific domain by the size of the pre-constructed conversational history repository, especially in the low-resource scenario. Furthermore, our proposed generation based method shows better generalization capability to tackle the low-resource challenge.

6 Conclusion

In this paper, we explore multimodal dialogue response generation under a low-resource setting. To overcome the challenge from the new task and insufficient training data, we propose Divter, a neural conversational agent which incorporates text-to-image generation into text-only dialogue response generation, in which most parameters do not rely on the training data any more and can be estimated from large scale textual open domain dialogues and (text description, image) pairs. Extensive experiments demonstrate Divter achieves state-of-the-art results in automatic and human evaluation. In the future, we will explore more efficient methods to inject more modalities into response generation.