Deep Modular Co-Attention Networks for Visual Question Answering

06/25/2019
by   Zhou Yu, et al.
2

Visual Question Answering (VQA) requires a fine-grained and simultaneous understanding of both the visual content of images and the textual content of questions. Therefore, designing an effective `co-attention' model to associate key words in questions with key objects in images is central to VQA performance. So far, most successful attempts at co-attention learning have been achieved by using shallow models, and deep co-attention models show little improvement over their shallow counterparts. In this paper, we propose a deep Modular Co-Attention Network (MCAN) that consists of Modular Co-Attention (MCA) layers cascaded in depth. Each MCA layer models the self-attention of questions and images, as well as the guided-attention of images jointly using a modular composition of two basic attention units. We quantitatively and qualitatively evaluate MCAN on the benchmark VQA-v2 dataset and conduct extensive ablation studies to explore the reasons behind MCAN's effectiveness. Experimental results demonstrate that MCAN significantly outperforms the previous state-of-the-art. Our best single model delivers 70.63% overall accuracy on the test-dev set. Code is available at https://github.com/MILVLG/mcan-vqa.

READ FULL TEXT

page 3

page 4

page 7

page 8

page 11

research
07/27/2020

REXUP: I REason, I EXtract, I UPdate with Structured Compositional Reasoning for Visual Question Answering

Visual question answering (VQA) is a challenging multi-modal task that r...
research
10/01/2022

A Dual-Attention Learning Network with Word and Sentence Embedding for Medical Visual Question Answering

Research in medical visual question answering (MVQA) can contribute to t...
research
08/23/2022

How good are deep models in understanding the generated images?

My goal in this paper is twofold: to study how well deep models can unde...
research
08/04/2017

Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering

Visual question answering (VQA) is challenging because it requires a sim...
research
04/17/2019

Question Guided Modular Routing Networks for Visual Question Answering

Visual Question Answering (VQA) faces two major challenges: how to bette...
research
08/07/2017

Structured Attentions for Visual Question Answering

Visual attention, which assigns weights to image regions according to th...
research
05/30/2022

An Efficient Modern Baseline for FloodNet VQA

Designing efficient and reliable VQA systems remains a challenging probl...

Please sign up or login with your details

Forgot password? Click here to reset