Lightweight Visual Question Answering using Scene Graphs

Visual question answering (VQA) is a challenging problem in machine perception, which requires the deep joint understanding of both visual and textual data. Recent research has advanced the automatic generation of high-quality scene graphs from images, while powerful yet elegant models like graph neural networks (GNNs) have shown a great power in reasoning over graph-structured data. In this work, we propose to bridge the gap between scene graph generation and VQA by leveraging GNNs. In particular, we design a new model called Conditional Enhanced Graph ATtention network (CE-GAT) to encode pairs of visual and semantic scene graphs with both node and edge features, which is seamlessly integrated with a textual question encoder to generate answers through questiongraph conditioning. Moreover, to alleviate the training difficulties of CE-GAT towards VQA, we enforce more useful inductive biases in the scene graphs through novel question-guided graph enriching and pruning. Finally, we evaluate the framework on one of the largest available VQA datasets (namely, GQA) with groundtruth scene graphs, achieving the accuracy of 77.87%, compared with the state of the art (namely, the neural state machine (NSM)), which gives 63.17%. Notably, by leveraging existing scene graphs, our framework is much lighter compared with end-to-end VQA methods (e.g., about 95.3% less parameters than a typical NSM).


page 3

page 4


Understanding the Role of Scene Graphs in Visual Question Answering

Visual Question Answering (VQA) is of tremendous interest to the researc...

How to Design Sample and Computationally Efficient VQA Models

In multi-modal reasoning tasks, such as visual question answering (VQA),...

OG-SGG: Ontology-Guided Scene Graph Generation. A Case Study in Transfer Learning for Telepresence Robotics

Scene graph generation from images is a task of great interest to applic...

VQA-GNN: Reasoning with Multimodal Semantic Graph for Visual Question Answering

Visual understanding requires seamless integration between recognition a...

Graph Neural Networks in Vision-Language Image Understanding: A Survey

2D image understanding is a complex problem within Computer Vision, but ...

Visual Query Answering by Entity-Attribute Graph Matching and Reasoning

Visual Query Answering (VQA) is of great significance in offering people...

Towards Efficient and Elastic Visual Question Answering with Doubly Slimmable Transformer

Transformer-based approaches have shown great success in visual question...

Please sign up or login with your details

Forgot password? Click here to reset