OpenViDial 2.0: A Larger-Scale, Open-Domain Dialogue Generation Dataset with Visual Contexts

09/27/2021 ∙ by Shuhe Wang, et al. ∙ 0

In order to better simulate the real human conversation process, models need to generate dialogue utterances based on not only preceding textual contexts but also visual contexts. However, with the development of multi-modal dialogue learning, the dataset scale gradually becomes a bottleneck. In this report, we release OpenViDial 2.0, a larger-scale open-domain multi-modal dialogue dataset compared to the previous version OpenViDial 1.0. OpenViDial 2.0 contains a total number of 5.6 million dialogue turns extracted from either movies or TV series from different resources, and each dialogue turn is paired with its corresponding visual context. We hope this large-scale dataset can help facilitate future researches on open-domain multi-modal dialog generation, e.g., multi-modal pretraining for dialogue generation.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Developing open-domain dialogue agents is of growing interest (Li et al., 2017; Ghazvininejad et al., 2017; Zhou et al., 2017; Gao et al., 2018; Asghar et al., 2018; Han et al., 2020a; Zhou et al., 2020). Existing methods for developing effective open-domain dialogue agents mostly follow a two-step pipeline: (1) collecting a large-scale dataset containing massive dialog turns from real conversations, and (2) training a neural model to learn to generate high quality responses given the previous dialogue contexts Li et al. (2016b, a); Zhang et al. (2018); Huang et al. (2020).

Since most methods are data-driven, a large-scale and high quality open-domain dialogue datasets may be the first matter to be considered before designing the model. Meng et al. (2020) released the OpenViDial dataset which contains a total number of 1.1 million dialogue turns with utterances paired with visual context. Some recent works leveraged the OpenViDial dataset and built effective multi-modal dialog models (Wang et al., 2021) on top, demonstrating that learning multi-modal features gives rise to higher response quality.

In this report, we collect and extend OpenViDial, releasing OpenViDial 2.0, a much larger-scale open-domain dialogue dataset with visual contexts. In common with the prior version OpenViDial 1.0 (Meng et al., 2020), the dialogue turns and visual contexts in OpenViDial 2.0 are also extracted from movies and TV series, where each dialogue turn is paired with the corresponding visual context in which it takes place. OpenViDial 2.0 contains a total number of 5.6 million dialogue turns along with 5.6 million visual contexts stored as images, a scale of 4 times larger than OpenViDial 1.0. We hope this large-scale dataset can help facilitate future researches on open-domain multi-modal dialog generation, e.g., multi-modal pretraining for dialogue generation.

2 Related Work

2.1 Open Domain Dialog Datasets

Textual Dialog Datasets

Since the task of open-domain dialog generation has developed for many years, there are various open-domain dialog datasets only consists textual information. For simulating the movie conversation, there are OpenSubtitle dataset (Tiedemann, 2009; Lison and Tiedemann, 2016) and Cornell Movie-Dialogs Corpus (Danescu-Niculescu-Mizil and Lee, 2011). The OpenSubtitle dataset is a large-scale dataset contains a total number of 3.35G sentence fragments extracted from the OpenSubtitle website, while the Cornell Movie-Dialogs Corpus contains a collection of movie conversations extracted from raw movie scripts. For simulating the social conversation, there are PersonaChat Zhang et al. (2018) and Twitter Triple Corpus Sordoni et al. (2015). The Twitter Triple Corpus consists of 4,232 Twitter conversation triples evaluated from 33K candidate triples by human raters. Other datasets such as the Ubuntu Dialog Corpus Lowe et al. (2015) and EmpatheticDialogues Rashkin et al. (2018) are both commonly used for textual open-domain dialog generation.

Visual Dialog Datasets

A mount of datasets containing visual features have been developed, since the task of VisualDialog is first introduced by Das et al. (2017a), where a model is required to answer questions by given a dialog history and the image itself as contexts. For this work, Das et al. (2017a) released VisDial v0.9 and v1.0 datasets which contains 120K images from MSCOCO222 and each image is associated with 10 rounds of question-answer dialog. Further, other datasets like the GuessWhat?! dataset de Vries et al. (2017), the CLEVERDialog dataset Kottur et al. (2019)

, the MNIST-Dialog dataset

Seo et al. (2017) and the Audio Visual Scene-Aware Dialog (AVSD) dataset (Hori et al., 2018; Alamri et al., 2019) are mainly focus more on answering questions according to an image or video rather than dialogue generation with visual contexts.The OpenViDial dataset Meng et al. (2020) is released to alleviate this situation, where contains 1.1M dialogue turns and each dialogue turn paired with the corresponding visual context in which it takes place. And thus, models need to learn to generate dialogue utterances not only based on preceding textual contexts but also visual contexts.

2.2 Dialog Generation

Open Domain Dialog Generation

Open-domain dialog generation is a simulation for real human conversations and is a traditional task in NLP (Weizenbaum, 1966; COLBY, 1975; Wallace, 2009). Currently, the most researches for open-domain dialog generation are based on sequence-to-sequence architecture (Vinyals and Le, 2015; Li et al., 2015; Dodge et al., 2016; Serban et al., 2016; Zhao et al., 2017; Xie et al., 2017; Lee et al., 2019; Ghandeharioun et al., 2019; Li, 2020; Han et al., 2020b; Zhang et al., 2019; Roller et al., 2020). And whether a model can generate diverse (Xu et al., 2018; Baheti et al., 2018), coherent (Li et al., 2016b, 2017; Tian et al., 2017; Bosselut et al., 2018; Adiwardana et al., 2020), informative (Shao et al., 2017; Lewis et al., 2017; Ghazvininejad et al., 2017; Young et al., 2017; Zhao et al., 2019) and knowledge-fused (Hua et al., 2020; Zhao et al., 2020; He et al., 2020) responses or not has become metrics to evaluate a dialog generation model. However, the mainly researches described above are developed on textual only and the development of multi-modal dialog generation is relatively slow since the lack of large-scale datasets.

Visual Dialog Generation

Most of existing works apply attention mechanisms to model the interplay between text and visual contexts (Lu et al., 2017; Kottur et al., 2018; Jiang and Bansal, 2019; Yang et al., 2019; Guo et al., 2019; Niu et al., 2019; Kang et al., 2019; Park et al., 2020; Jiang et al., 2020b)

. Other techniques like reinforcement learning

(Das et al., 2017b; Wu et al., 2018), variational auto-encoders Massiceti et al. (2018) and graph networks (Zheng et al., 2019; Jiang et al., 2020a) have also been employed to the visual dialog task. More recently, based on the OpenViDial dataset Meng et al. (2020), Wang et al. (2021) proposed three attention-based models Vaswani et al. (2017) to generate dialogue utterances given the preceding text-visual contexts and further proposed to build text-visual dependency to improve the dialogue quality, making an initial step for the task of text-visual open-domain dialogue generation rather than answering questions based on an image.

Statistics OpenViDial 1.0 OpenViDial 2.0
Number of turns 1.1M 5.6M
Number of images 1.1M 5.6M
Vocab size before BPE 70K 278K
Vocab size after BPE 30K 30K
Average length of each episode 14 48
Average length of each turn 7.6 8.3
Table 1: Detailed statistics for OpenViDial 2.0 and a comparison to OpenViDial 1.0.
OpenViDial 1.0 OpenViDial 2.0
Train 1M 4.6M
Dev 50K 0.5M
Test 50K 0.5M
Table 2: Splitting for training, dev and test

3 Constructing OpenViDial 2.0

In this section, we describe the details of constructing of OpenViDial 2.0. We first collect a raw dataset consisting of about 800 English movies and TV series with an average length of 2.5 hours per video. Each video has a corresponding external English subtitle file where each line is a string including the subtitle text and the time interval. There is no video embedded with any internal subtitles.

The full process to construct OpenViDial 2.0 can be divided into three steps: (1) segmenting each video into multiple frames; (2) pairing each frame with subtitle text from it corresponding subtitle file; (3) splitting these (image, text) pairs into different dialog turns. The OpenCV Bradski (2000) toolkit is used to segment each video into multiple images by frame, and we discard the initial and the last 10 minutes of each video because of the general existence of intro in movies and TV series. To pair images with textual subtitles for each video, we first read the video’s subtitle file row-by-row and obtain the time interval as well as the subtitle text. Then, we extract a group of images according to the time interval, and randomly choose one image from the group as the visual context paired with the subtitle text, forming a paired (image, text) dialog turn.

We are able to construct a final dataset of 5.6M dialog turns, where each turn consists of a sequence of words and an image. The size of the image is one of (1) 1280720, (2) 19201080, and (3) 20481080 according to different video resources. We employ the BPE tokenizer Sennrich et al. (2016) to preprocess the text. A detailed comparison with OpenViDial 1.0 is shown in Table 1. The splitting for training, dev and test is shown in Table 2.

Dataset Genre Multi-Modal? # Sentences # Images
OpenSubtitles 2016 (Lison and Tiedemann, 2016) Plain-text Dialog 337M
Cornell Movie-Dialogs (Danescu-Niculescu-Mizil and Lee, 2011) Plain-text Dialog 0.3M
VisDial v1.0 (Das et al., 2017a) VQA 2.4M 120K
Guess-What?! (de Vries et al., 2017) VQA 0.8M 66K
AVSD (Alamri et al., 2019) VQA 152K
OpenViDial 1.0 (Meng et al., 2020) Visual+Text Dialog 1.1M 1.1M
OpenViDial 2.0 Visual+Text Dialog 5.6M 5.6M
Table 3: A comparison of different datasets. VQA: Visual Question Answering.

In Table 3, we make a comparison with existing widely-used dialog datasets. Both OpenViDial 1.0 and OpenViDial 2.0 focus on multi-modal dialog generation in comparison to VisDial, Guess-What?! and AVSD which focus more on VQA. Comparing against OpenViDial 1.0, OpenViDial 2.0 is much larger in scale, about 5 times as big as OpenViDial 1.0.

System Model BLEU Dis-1 Dis-2 Dis-3 Dis-4
NV w/o MI 1.95 0.0037 0.0302 0.0929 0.1711
w/ MI 1.96 0.0039 0.0311 0.0953 0.1630
CV w/o MI 1.97 0.0041 0.0353 0.0999 0.1726
w/ MI 1.98 0.0047 0.0392 0.1093 0.1774
FV w/o MI 1.99 0.0056 0.0431 0.1250 0.2215
w/ MI 2.00 0.0060 0.0460 0.1321 0.2311
Table 4: Automatic evaluation results for BLEU, Stopword% and Diversity.

To evaluate OpenViDial 2.0, we experiment on OpenViDial2.0 using multi-modal dialog models proposed in Wang et al. (2021).

3.1 Vanilla Visual Dialog Models

According to the granularity of the visual features ranges from none, coarse-grained image features to fine-grained object features, Wang et al. (2021) proposed three vanilla visual dialog models: (1) the NoVisual(NV) model, (2) the CoarseVisual(CV) model and (3) the FineVisual(FV) model.


The NV model is a general uni-modal dialog generation model, which is required to learn to generate responses using only dialog texts without visual information. A standard Transformer Vaswani et al. (2017) architecture is used as the backbone for the NV model. For each dialog turn, all the preceding dialog texts are packed into a long sequence with a special token as the delimiter. Then, this sequence is embedded with positional encodings including sentence-level positional encoding and token-level positional encoding. Last, it is fed to the Transfromer as input.


In contrast to the NV model, the CV model injects coarse-level visual information into dialog generation. For each dialog turn, it utilizes a ResNet-50 model He et al. (2016)

pre-trained on ImageNet

Krizhevsky et al. (2012) to extract a high-dimensional feature for each image as the visual information. Then the image feature is added to its corresponding text representation forming the text-visual feature. Positional encodings are also used to notify position information. The concatenated long text-visual sequence is fed into the Transformer model.


Extracting visual information from a coarse view might be insufficient to model fine-grained visual elements in images such as facial expressions, body gestures as well as physical motions. The FV model thus uses Faster R-CNN Ren et al. (2015) pre-trained on Visual Genome Krishna et al. (2017) to extract fine-grained visual features. Different from the CV model, the FV model directly concatenates the set of extracted fine-grained visual information with the dialog texts into a long sequence. And except for the sentence-level and token-level positional embeddings, there is an additional positional embedding for visual features.

3.2 Visual-Text Mutual Dependeny

Although each response is generated according to the preceding textual and visual contexts, there is no guarantee on whether or how much the visual contexts are used. To significantly strength the connection between the generated response and its visual contexts, Wang et al. (2021)

proposed to model the mutual information (MI) between visual contexts and text features. To put it simply, we use visual feature to represent both the coarse-grained feature and the fine-grained feature. For building the connection between visual contexts and textual utterances, a light discriminative network is trained. The whole requirement for the discriminative network is to discriminate the degree of the connection between the given visual feature and textual feature. In each inference step, both the CV and FV model are required to generate N-best responses list with its probability as the forward probability rather than only the best response. And each response in N-best list along with the preceding visual feature are fed into the former trained discriminative network obtaining the backward probability. Finally, the forward probability and the backward probability is concatenated to rerank the N-best list. For more details please refer to

Wang et al. (2021).

3.3 Results

Following Wang et al. (2021)

, we report the results in terms of the following automatic evaluation metrics:

  • BLEU: BLEU score is a common automatic evaluation method for majority NLP tasks (Papineni et al., 2002; Sordoni et al., 2015)

    , which score the n-gram overlaps between the generated sequences and reference sequences. For our experiment we report the BLEU-4 score.

  • Diversity: Diversity is usually reported in the task of dialogue generation Li et al. (2015), which score the number of distinct n-grams in generated responses, and for this experiment.

Results are shown in Table 4. Since OpenViDial 2.0 is much larger than OpenViDial 1.0, we only use the top 5 objects for FineVisual model compared to using top 20 objects on OpenViDial 1.0, and this is the main reason why FV doesn’t significantly perform better than FV and NV.

4 Conclusion

In this report, we release OpenViDial 2.0, a larger-scale open-domain multi-modal dialogue dataset with visual contexts, updated from the previous version 1.0. OpenViDial 2.0 contains a total number of 5.6 million dialogue turns extracted from either movies or TV series from different resources, and is four times larger than version 1.0 at scale. We hope this large-scale dataset can help facilitate future researches on open-domain multi-modal dialog generation. OpenViDial 2.0 is available at