SceneGATE: Scene-Graph based co-Attention networks for TExt visual question answering

12/16/2022
by   Siwen Luo, et al.
0

Most TextVQA approaches focus on the integration of objects, scene texts and question words by a simple transformer encoder. But this fails to capture the semantic relations between different modalities. The paper proposes a Scene Graph based co-Attention Network (SceneGATE) for TextVQA, which reveals the semantic relations among the objects, Optical Character Recognition (OCR) tokens and the question words. It is achieved by a TextVQA-based scene graph that discovers the underlying semantics of an image. We created a guided-attention module to capture the intra-modal interplay between the language and the vision as a guidance for inter-modal interactions. To make explicit teaching of the relations between the two modalities, we proposed and integrated two attention modules, namely a scene graph-based semantic relation-aware attention and a positional relation-aware attention. We conducted extensive experiments on two benchmark datasets, Text-VQA and ST-VQA. It is shown that our SceneGATE method outperformed existing ones because of the scene graph and its attention modules.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/29/2019

Relation-aware Graph Attention Network for Visual Question Answering

In order to answer semantically-complicated questions about an image, a ...
research
10/24/2020

RUArt: A Novel Text-Centered Solution for Text-Based Visual Question Answering

Text-based visual question answering (VQA) requires to read and understa...
research
09/25/2019

Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching

Learning semantic correspondence between image and text is significant a...
research
03/24/2022

Towards Escaping from Language Bias and OCR Error: Semantics-Centered Text Visual Question Answering

Texts in scene images convey critical information for scene understandin...
research
03/31/2020

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text

Answering questions that require reading texts in an image is challengin...
research
03/18/2022

Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased Scene Graph Generation

Scene Graph Generation, which generally follows a regular encoder-decode...
research
07/17/2020

Sketching Image Gist: Human-Mimetic Hierarchical Scene Graph Generation

Scene graph aims to faithfully reveal humans' perception of image conten...

Please sign up or login with your details

Forgot password? Click here to reset