Audio-Visual Scene-Aware Dialog and Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning

10/13/2021
by   Ankit P. Shah, et al.
0

In previous work, we have proposed the Audio-Visual Scene-Aware Dialog (AVSD) task, collected an AVSD dataset, developed AVSD technologies, and hosted an AVSD challenge track at both the 7th and 8th Dialog System Technology Challenges (DSTC7, DSTC8). In these challenges, the best-performing systems relied heavily on human-generated descriptions of the video content, which were available in the datasets but would be unavailable in real-world applications. To promote further advancements for real-world applications, we proposed a third AVSD challenge, at DSTC10, with two modifications: 1) the human-created description is unavailable at inference time, and 2) systems must demonstrate temporal reasoning by finding evidence from the video to support each answer. This paper introduces the new task that includes temporal reasoning and our new extension of the AVSD dataset for DSTC10, for which we collected human-generated temporal reasoning data. We also introduce a baseline system built using an AV-transformer, which we released along with the new dataset. Finally, this paper introduces a new system that extends our baseline system with attentional multimodal fusion, joint student-teacher learning (JSTL), and model combination techniques, achieving state-of-the-art performances on the AVSD datasets for DSTC7, DSTC8, and DSTC10. We also propose two temporal reasoning methods for AVSD: one attention-based, and one based on a time-domain region proposal network.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/21/2018

End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features

Dialog systems need to understand dynamic visual scenes in order to have...
research
11/14/2019

The Eighth Dialog System Technology Challenge

This paper introduces the Eighth Dialog System Technology Challenge. In ...
research
02/01/2020

Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog

Audio-Visual Scene-Aware Dialog (AVSD) is a task to generate responses w...
research
01/11/2019

Dialog System Technology Challenge 7

This paper introduces the Seventh Dialog System Technology Challenges (D...
research
12/20/2018

Context, Attention and Audio Feature Explorations for Audio Visual Scene-Aware Dialog

With the recent advancements in AI, Intelligent Virtual Assistants (IVA)...
research
08/10/2021

TrUMAn: Trope Understanding in Movies and Animations

Understanding and comprehending video content is crucial for many real-w...
research
04/11/2019

A Simple Baseline for Audio-Visual Scene-Aware Dialog

The recently proposed audio-visual scene-aware dialog task paves the way...

Please sign up or login with your details

Forgot password? Click here to reset