Multi-View Attention Networks for Visual Dialog

04/29/2020
by   Sungjin Park, et al.
10

Visual dialog is a challenging vision-language task in which a series of questions visually grounded by a given image are answered. To resolve the visual dialog task, a high-level understanding of various multimodal inputs (e.g., question, dialog history, image, and answer) is required. Specifically, it is necessary for an agent to 1) understand question-relevant dialog history and 2) focus on question-relevant visual contents among the diverse visual contents in a given image. In this paper, we propose Multi-View Attention Network (MVAN), which considers complementary views of multimodal inputs based on attention mechanisms. MVAN effectively captures the question-relevant information from the dialog history with two different textual-views (i.e., Topic Aggregation and Context Matching), and integrates multimodal representations with two-step fusion process. Experimental results on VisDial v1.0 and v0.9 benchmarks show the effectiveness of our proposed model, which outperforms the previous state-of-the-art methods with respect to all evaluation metrics.

READ FULL TEXT

page 1

page 3

page 8

page 11

page 12

research
12/06/2018

Recursive Visual Attention in Visual Dialog

Visual dialog is a challenging vision-language task, which requires the ...
research
12/18/2019

DMRM: A Dual-channel Multi-hop Reasoning Model for Visual Dialog

Visual Dialog is a vision-language task that requires an AI agent to eng...
research
02/01/2019

Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog

This paper presents Recurrent Dual Attention Network (ReDAN) for visual ...
research
12/07/2021

UNITER-Based Situated Coreference Resolution with Rich Multimodal Input

We present our work on the multimodal coreference resolution task of the...
research
04/10/2022

Reasoning with Multi-Structure Commonsense Knowledge in Visual Dialog

Visual Dialog requires an agent to engage in a conversation with humans ...
research
08/02/2020

SeqDialN: Sequential Visual Dialog Networks in Joint Visual-Linguistic Representation Space

In this work, we formulate a visual dialog as an information flow in whi...
research
04/14/2020

DialGraph: Sparse Graph Learning Networks for Visual Dialog

Visual dialog is a task of answering a sequence of questions grounded in...

Please sign up or login with your details

Forgot password? Click here to reset