DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue

by   Hung Le, et al.

A video-grounded dialogue system is required to understand both dialogue, which contains semantic dependencies from turn to turn, and video, which contains visual cues of spatial and temporal scene variations. Building such dialogue systems is a challenging problem involving complex multimodal and temporal inputs, and studying them independently is hard with existing datasets. Existing benchmarks do not have enough annotations to help analyze dialogue systems and understand their linguistic and visual reasoning capability and limitations in isolation. These benchmarks are also not explicitly designed to minimize biases that models can exploit without actual reasoning. To address these limitations, in this paper, we present a diagnostic dataset that can test a range of reasoning abilities on videos and dialogues. The dataset is designed to contain minimal biases and has detailed annotations for the different types of reasoning each question requires, including cross-turn video interval tracking and dialogue object tracking. We use our dataset to analyze several dialogue system approaches, providing interesting insights into their abilities and limitations. In total, the dataset contains 10 instances of 10-round dialogues for each of ∼11k synthetic videos, resulting in more than 100k dialogues and 1M question-answer pairs. Our code and dataset will be made public.


page 2

page 7

page 14

page 16


CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

When building artificial intelligence systems that can reason and answer...

Learning Reasoning Paths over Semantic Graphs for Video-grounded Dialogues

Compared to traditional visual question answering, video-grounded dialog...

CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog

Visual Dialog is a multimodal task of answering a sequence of questions ...

C^3: Compositional Counterfactual Constrastive Learning for Video-grounded Dialogues

Video-grounded dialogue systems aim to integrate video understanding and...

BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded Dialogues

Video-grounded dialogues are very challenging due to (i) the complexity ...

AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning

Visual events are a composition of temporal actions involving actors spa...

Structured Co-reference Graph Attention for Video-grounded Dialogue

A video-grounded dialogue system referred to as the Structured Co-refere...

Please sign up or login with your details

Forgot password? Click here to reset