End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features

06/21/2018
by   Chiori Hori, et al.
0

Dialog systems need to understand dynamic visual scenes in order to have conversations with users about the objects and events around them. Scene-aware dialog systems for real-world applications could be developed by integrating state-of-the-art technologies from multiple research areas, including: end-to-end dialog technologies, which generate system responses using models trained from dialog data; visual question answering (VQA) technologies, which answer questions about images using learned image features; and video description technologies, in which descriptions/captions are generated from videos using multimodal information. We introduce a new dataset of dialogs about videos of human behaviors. Each dialog is a typed conversation that consists of a sequence of 10 question-and-answer(QA) pairs between two Amazon Mechanical Turk (AMT) workers. In total, we collected dialogs on roughly 9,000 videos. Using this new dataset for Audio Visual Scene-aware dialog (AVSD), we trained an end-to-end conversation model that generates responses in a dialog about a video. Our experiments demonstrate that using multimodal features that were developed for multimodal attention-based video description enhances the quality of generated dialog about dynamic scenes (videos). Our dataset, model code and pretrained models will be publicly available for a new Video Scene-Aware Dialog challenge.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/01/2018

Audio Visual Scene-Aware Dialog (AVSD) Challenge at DSTC7

Scene-aware dialog systems will be able to have conversations with users...
research
10/13/2021

Audio-Visual Scene-Aware Dialog and Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning

In previous work, we have proposed the Audio-Visual Scene-Aware Dialog (...
research
09/24/2021

An animated picture says at least a thousand words: Selecting Gif-based Replies in Multimodal Dialog

Online conversations include more than just text. Increasingly, image-ba...
research
09/04/2023

LoRA-like Calibration for Multimodal Deception Detection using ATSFace Data

Recently, deception detection on human videos is an eye-catching techniq...
research
12/20/2019

Exploring Context, Attention and Audio Features for Audio Visual Scene-Aware Dialog

We are witnessing a confluence of vision, speech and dialog system techn...
research
12/20/2019

Leveraging Topics and Audio Features with Multimodal Attention for Audio Visual Scene-Aware Dialog

With the recent advancements in Artificial Intelligence (AI), Intelligen...
research
07/08/2020

Spatio-Temporal Scene Graphs for Video Dialog

The Audio-Visual Scene-aware Dialog (AVSD) task requires an agent to ind...

Please sign up or login with your details

Forgot password? Click here to reset