Graph Relation Transformer: Incorporating pairwise object features into the Transformer architecture

11/11/2021
by   Michael Yang, et al.
0

Previous studies such as VizWiz find that Visual Question Answering (VQA) systems that can read and reason about text in images are useful in application areas such as assisting visually-impaired people. TextVQA is a VQA dataset geared towards this problem, where the questions require answering systems to read and reason about visual objects and text objects in images. One key challenge in TextVQA is the design of a system that effectively reasons not only about visual and text objects individually, but also about the spatial relationships between these objects. This motivates the use of 'edge features', that is, information about the relationship between each pair of objects. Some current TextVQA models address this problem but either only use categories of relations (rather than edge feature vectors) or do not use edge features within the Transformer architectures. In order to overcome these shortcomings, we propose a Graph Relation Transformer (GRT), which uses edge information in addition to node information for graph attention computation in the Transformer. We find that, without using any other optimizations, the proposed GRT method outperforms the accuracy of the M4C baseline model by 0.65 val set and 0.57 superior spatial reasoning ability to M4C.

READ FULL TEXT

page 1

page 8

research
12/22/2021

CLEVR3D: Compositional Language and Elementary Visual Reasoning for Question Answering in 3D Real-World Scenes

3D scene understanding is a relatively emerging research field. In this ...
research
04/03/2022

Question-Driven Graph Fusion Network For Visual Question Answering

Existing Visual Question Answering (VQA) models have explored various vi...
research
03/29/2019

Relation-aware Graph Attention Network for Visual Question Answering

In order to answer semantically-complicated questions about an image, a ...
research
10/24/2020

RUArt: A Novel Text-Centered Solution for Text-Based Visual Question Answering

Text-based visual question answering (VQA) requires to read and understa...
research
04/18/2019

Towards VQA Models that can Read

Studies have shown that a dominant class of questions asked by visually ...
research
06/01/2020

Structured Multimodal Attentions for TextVQA

Text based Visual Question Answering (TextVQA) is a recently raised chal...
research
07/02/2021

Case Relation Transformer: A Crossmodal Language Generation Model for Fetching Instructions

There have been many studies in robotics to improve the communication sk...

Please sign up or login with your details

Forgot password? Click here to reset