LOIS: Looking Out of Instance Semantics for Visual Question Answering

07/26/2023
by   Siyu Zhang, et al.
0

Visual question answering (VQA) has been intensively studied as a multimodal task that requires effort in bridging vision and language to infer answers correctly. Recent attempts have developed various attention-based modules for solving VQA tasks. However, the performance of model inference is largely bottlenecked by visual processing for semantics understanding. Most existing detection methods rely on bounding boxes, remaining a serious challenge for VQA models to understand the causal nexus of object semantics in images and correctly infer contextual information. To this end, we propose a finer model framework without bounding boxes in this work, termed Looking Out of Instance Semantics (LOIS) to tackle this important issue. LOIS enables more fine-grained feature descriptions to produce visual facts. Furthermore, to overcome the label ambiguity caused by instance masks, two types of relation attention modules: 1) intra-modality and 2) inter-modality, are devised to infer the correct answers from the different multi-view features. Specifically, we implement a mutual relation attention module to model sophisticated and deeper visual semantic relations between instance objects and background information. In addition, our proposed attention model can further analyze salient image regions by focusing on important word-related questions. Experimental results on four benchmark VQA datasets prove that our proposed method has favorable performance in improving visual reasoning capability.

READ FULL TEXT

page 1

page 2

page 3

page 4

page 5

page 8

page 12

research
05/24/2018

R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering

Recently, Visual Question Answering (VQA) has emerged as one of the most...
research
08/10/2019

Multi-modality Latent Interaction Network for Visual Question Answering

Exploiting relationships between visual regions and question words have ...
research
04/06/2016

A Focused Dynamic Attention Model for Visual Question Answering

Visual Question and Answering (VQA) problems are attracting increasing i...
research
11/02/2016

Dual Attention Networks for Multimodal Reasoning and Matching

We propose Dual Attention Networks (DANs) which jointly leverage visual ...
research
03/16/2023

Logical Implications for Visual Question Answering Consistency

Despite considerable recent progress in Visual Question Answering (VQA) ...
research
06/04/2022

From Pixels to Objects: Cubic Visual Attention for Visual Question Answering

Recently, attention-based Visual Question Answering (VQA) has achieved g...
research
12/06/2019

Weak Supervision helps Emergence of Word-Object Alignment and improves Vision-Language Tasks

The large adoption of the self-attention (i.e. transformer model) and BE...

Please sign up or login with your details

Forgot password? Click here to reset