SA-VQA: Structured Alignment of Visual and Semantic Representations for Visual Question Answering

01/25/2022
by   Peixi Xiong, et al.
11

Visual Question Answering (VQA) attracts much attention from both industry and academia. As a multi-modality task, it is challenging since it requires not only visual and textual understanding, but also the ability to align cross-modality representations. Previous approaches extensively employ entity-level alignments, such as the correlations between the visual regions and their semantic labels, or the interactions across question words and object features. These attempts aim to improve the cross-modality representations, while ignoring their internal relations. Instead, we propose to apply structured alignments, which work with graph representation of visual and textual content, aiming to capture the deep connections between the visual and textual modalities. Nevertheless, it is nontrivial to represent and integrate graphs for structured alignments. In this work, we attempt to solve this issue by first converting different modality entities into sequential nodes and the adjacency graph, then incorporating them for structured alignments. As demonstrated in our experimental results, such a structured alignment improves reasoning performance. In addition, our model also exhibits better interpretability for each generated answer. The proposed model, without any pretraining, outperforms the state-of-the-art methods on GQA dataset, and beats the non-pretrained state-of-the-art methods on VQA-v2 dataset.

READ FULL TEXT

page 1

page 8

research
01/25/2022

MGA-VQA: Multi-Granularity Alignment for Visual Question Answering

Learning to answer visual questions is a challenging task since the mult...
research
05/12/2020

Cross-Modality Relevance for Reasoning on Language and Vision

This work deals with the challenge of learning and reasoning over langua...
research
08/10/2019

Multi-modality Latent Interaction Network for Visual Question Answering

Exploiting relationships between visual regions and question words have ...
research
03/29/2019

Relation-aware Graph Attention Network for Visual Question Answering

In order to answer semantically-complicated questions about an image, a ...
research
09/19/2016

Graph-Structured Representations for Visual Question Answering

This paper proposes to improve visual question answering (VQA) with stru...
research
04/17/2019

Question Guided Modular Routing Networks for Visual Question Answering

Visual Question Answering (VQA) faces two major challenges: how to bette...
research
12/06/2019

Weak Supervision helps Emergence of Word-Object Alignment and improves Vision-Language Tasks

The large adoption of the self-attention (i.e. transformer model) and BE...

Please sign up or login with your details

Forgot password? Click here to reset