BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded Dialogues

10/20/2020
by   Hung Le, et al.
0

Video-grounded dialogues are very challenging due to (i) the complexity of videos which contain both spatial and temporal variations, and (ii) the complexity of user utterances which query different segments and/or different objects in videos over multiple dialogue turns. However, existing approaches to video-grounded dialogues often focus on superficial temporal-level visual cues, but neglect more fine-grained spatial signals from videos. To address this drawback, we propose Bi-directional Spatio-Temporal Learning (BiST), a vision-language neural framework for high-resolution queries in videos based on textual cues. Specifically, our approach not only exploits both spatial and temporal-level information, but also learns dynamic information diffusion between the two feature spaces through spatial-to-temporal and temporal-to-spatial reasoning. The bidirectional strategy aims to tackle the evolving semantics of user queries in the dialogue setting. The retrieved visual cues are used as contextual information to construct relevant responses to the users. Our empirical results and comprehensive qualitative analysis show that BiST achieves competitive performance and generates reasonable responses on a large-scale AVSD benchmark. We also adapt our BiST models to the Video QA setting, and substantially outperform prior approaches on the TGIF-QA benchmark.

READ FULL TEXT

page 1

page 4

page 9

page 14

research
06/27/2020

Video-Grounded Dialogues with Pretrained Generation Language Models

Pre-trained language models have shown remarkable success in improving v...
research
01/17/2020

Spatio-Temporal Ranked-Attention Networks for Video Captioning

Generating video descriptions automatically is a challenging task that i...
research
05/04/2023

ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning over Untrimmed Videos

Building benchmarks to systemically analyze different capabilities of vi...
research
01/01/2021

DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue

A video-grounded dialogue system is required to understand both dialogue...
research
06/16/2021

C^3: Compositional Counterfactual Constrastive Learning for Video-grounded Dialogues

Video-grounded dialogue systems aim to integrate video understanding and...
research
03/24/2021

Structured Co-reference Graph Attention for Video-grounded Dialogue

A video-grounded dialogue system referred to as the Structured Co-refere...
research
06/26/2023

FunQA: Towards Surprising Video Comprehension

Surprising videos, e.g., funny clips, creative performances, or visual i...

Please sign up or login with your details

Forgot password? Click here to reset