Visual Coreference Resolution in Visual Dialog using Neural Module Networks

09/06/2018
by   Satwik Kottur, et al.
0

Visual dialog entails answering a series of questions grounded in an image, using dialog history as context. In addition to the challenges found in visual question answering (VQA), which can be seen as one-round dialog, visual dialog encompasses several more. We focus on one such problem called visual coreference resolution that involves determining which words, typically noun phrases and pronouns, co-refer to the same entity/object instance in an image. This is crucial, especially for pronouns (e.g., `it'), as the dialog agent must first link it to a previous coreference (e.g., `boat'), and only then can rely on the visual grounding of the coreference `boat' to reason about the pronoun `it'. Prior work (in visual dialog) models visual coreference resolution either (a) implicitly via a memory network over history, or (b) at a coarse level for the entire question; and not explicitly at a phrase level of granularity. In this work, we propose a neural module network architecture for visual dialog by introducing two novel modules - Refer and Exclude - that perform explicit, grounded, coreference resolution at a finer word level. We demonstrate the effectiveness of our model on MNIST Dialog, a visually simple yet coreference-wise complex dataset, by achieving near perfect accuracy, and on VisDial, a large and challenging visual dialog dataset on real images, where our model outperforms other approaches, and is more interpretable, grounded, and consistent qualitatively.

READ FULL TEXT
research
02/25/2019

Dual Attention Networks for Visual Reference Resolution in Visual Dialog

Visual dialog (VisDial) is a task which requires an AI agent to answer a...
research
03/07/2019

CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog

Visual Dialog is a multimodal task of answering a sequence of questions ...
research
05/29/2022

VD-PCR: Improving Visual Dialog with Pronoun Coreference Resolution

The visual dialog task requires an AI agent to interact with humans in m...
research
08/22/2022

Neuro-Symbolic Visual Dialog

We propose Neuro-Symbolic Visual Dialog (NSVD) -the first method to comb...
research
11/02/2020

Reasoning Over History: Context Aware Visual Dialog

While neural models have been shown to exhibit strong performance on sin...
research
04/14/2020

DialGraph: Sparse Graph Learning Networks for Visual Dialog

Visual dialog is a task of answering a sequence of questions grounded in...
research
09/13/2021

Learning to Ground Visual Objects for Visual Dialog

Visual dialog is challenging since it needs to answer a series of cohere...

Please sign up or login with your details

Forgot password? Click here to reset