Leveraging Topics and Audio Features with Multimodal Attention for Audio Visual Scene-Aware Dialog

12/20/2019
by   Shachi H Kumar, et al.
0

With the recent advancements in Artificial Intelligence (AI), Intelligent Virtual Assistants (IVA) such as Alexa, Google Home, etc., have become a ubiquitous part of many homes. Currently, such IVAs are mostly audio-based, but going forward, we are witnessing a confluence of vision, speech and dialog system technologies that are enabling the IVAs to learn audio-visual groundings of utterances. This will enable agents to have conversations with users about the objects, activities and events surrounding them. In this work, we present three main architectural explorations for the Audio Visual Scene-Aware Dialog (AVSD): 1) investigating `topics' of the dialog as an important contextual feature for the conversation, 2) exploring several multimodal attention mechanisms during response generation, 3) incorporating an end-to-end audio classification ConvNet, AclNet, into our architecture. We discuss detailed analysis of the experimental results and show that our model variations outperform the baseline system presented for the AVSD task.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/20/2018

Context, Attention and Audio Feature Explorations for Audio Visual Scene-Aware Dialog

With the recent advancements in AI, Intelligent Virtual Assistants (IVA)...
research
12/20/2019

Exploring Context, Attention and Audio Features for Audio Visual Scene-Aware Dialog

We are witnessing a confluence of vision, speech and dialog system techn...
research
06/21/2018

End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features

Dialog systems need to understand dynamic visual scenes in order to have...
research
04/11/2019

A Simple Baseline for Audio-Visual Scene-Aware Dialog

The recently proposed audio-visual scene-aware dialog task paves the way...
research
01/17/2020

Multi-step Joint-Modality Attention Network for Scene-Aware Dialogue System

Understanding dynamic scenes and dialogue contexts in order to converse ...
research
02/21/2022

Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations

There have been many attempts to build multimodal dialog systems that ca...
research
09/16/2018

Decision-support for the Masses by Enabling Conversations with Open Data

Open data refers to data that is freely available for reuse. Although th...

Please sign up or login with your details

Forgot password? Click here to reset