Lightweight Visual Question Answering using Scene Graphs

Visual question answering (VQA) is a challenging problem in machine perception, which requires the deep joint understanding of both visual and textual data. Recent research has advanced the automatic generation of high-quality scene graphs from images, while powerful yet elegant models like graph neural networks (GNNs) have shown a great power in reasoning over graph-structured data. In this work, we propose to bridge the gap between scene graph generation and VQA by leveraging GNNs. In particular, we design a new model called Conditional Enhanced Graph ATtention network (CE-GAT) to encode pairs of visual and semantic scene graphs with both node and edge features, which is seamlessly integrated with a textual question encoder to generate answers through questiongraph conditioning. Moreover, to alleviate the training difficulties of CE-GAT towards VQA, we enforce more useful inductive biases in the scene graphs through novel question-guided graph enriching and pruning. Finally, we evaluate the framework on one of the largest available VQA datasets (namely, GQA) with groundtruth scene graphs, achieving the accuracy of 77.87%, compared with the state of the art (namely, the neural state machine (NSM)), which gives 63.17%. Notably, by leveraging existing scene graphs, our framework is much lighter compared with end-to-end VQA methods (e.g., about 95.3% less parameters than a typical NSM).

READ FULL TEXT

page 3

page 4

research
01/14/2021

Understanding the Role of Scene Graphs in Visual Question Answering

Visual Question Answering (VQA) is of tremendous interest to the researc...
research
03/22/2021

How to Design Sample and Computationally Efficient VQA Models

In multi-modal reasoning tasks, such as visual question answering (VQA),...
research
02/21/2022

OG-SGG: Ontology-Guided Scene Graph Generation. A Case Study in Transfer Learning for Telepresence Robotics

Scene graph generation from images is a task of great interest to applic...
research
05/23/2022

VQA-GNN: Reasoning with Multimodal Semantic Graph for Visual Question Answering

Visual understanding requires seamless integration between recognition a...
research
03/07/2023

Graph Neural Networks in Vision-Language Image Understanding: A Survey

2D image understanding is a complex problem within Computer Vision, but ...
research
03/16/2019

Visual Query Answering by Entity-Attribute Graph Matching and Reasoning

Visual Query Answering (VQA) is of great significance in offering people...
research
03/24/2022

Towards Efficient and Elastic Visual Question Answering with Doubly Slimmable Transformer

Transformer-based approaches have shown great success in visual question...

Please sign up or login with your details

Forgot password? Click here to reset