From FiLM to Video: Multi-turn Question Answering with Multi-modal Context

12/17/2018
by   Dat Tien Nguyen, et al.
0

Understanding audio-visual content and the ability to have an informative conversation about it have both been challenging areas for intelligent systems. The Audio Visual Scene-aware Dialog (AVSD) challenge, organized as a track of the Dialog System Technology Challenge 7 (DSTC7), proposes a combined task, where a system has to answer questions pertaining to a video given a dialogue with previous question-answer pairs and the video itself. We propose for this task a hierarchical encoder-decoder model which computes a multi-modal embedding of the dialogue context. It first embeds the dialogue history using two LSTMs. We extract video and audio frames at regular intervals and compute semantic features using pre-trained I3D and VGGish models, respectively. Before summarizing both modalities into fixed-length vectors using LSTMs, we use FiLM blocks to condition them on the embeddings of the current question, which allows us to reduce the dimensionality considerably. Finally, we use an LSTM decoder that we train with scheduled sampling and evaluate using beam search. Compared to the modality-fusing baseline model released by the AVSD challenge organizers, our model achieves a relative improvements of more than 16 scoring 0.36 BLEU-4 and more than 33

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/01/2020

Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog

Audio-Visual Scene-Aware Dialog (AVSD) is a task to generate responses w...
research
01/17/2020

Multi-step Joint-Modality Attention Network for Scene-Aware Dialogue System

Understanding dynamic scenes and dialogue contexts in order to converse ...
research
08/22/2019

Entropy-Enhanced Multimodal Attention Model for Scene-Aware Dialogue Generation

With increasing information from social media, there are more and more v...
research
06/05/2023

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

We present Video-LLaMA, a multi-modal framework that empowers Large Lang...
research
07/08/2020

Spatio-Temporal Scene Graphs for Video Dialog

The Audio-Visual Scene-aware Dialog (AVSD) task requires an agent to ind...
research
05/26/2020

History-Aware Question Answering in a Blocks World Dialogue System

It is essential for dialogue-based spatial reasoning systems to maintain...
research
01/17/2020

Modality-Balanced Models for Visual Dialogue

The Visual Dialog task requires a model to exploit both image and conver...

Please sign up or login with your details

Forgot password? Click here to reset