Multimodal Graph Transformer for Multimodal Question Answering

04/30/2023
by   Xuehai He, et al.
2

Despite the success of Transformer models in vision and language tasks, they often learn knowledge from enormous data implicitly and cannot utilize structured input data directly. On the other hand, structured learning approaches such as graph neural networks (GNNs) that integrate prior information can barely compete with Transformer models. In this work, we aim to benefit from both worlds and propose a novel Multimodal Graph Transformer for question answering tasks that requires performing reasoning across multiple modalities. We introduce a graph-involved plug-and-play quasi-attention mechanism to incorporate multimodal graph information, acquired from text and visual data, to the vanilla self-attention as effective prior. In particular, we construct the text graph, dense region graph, and semantic graph to generate adjacency matrices, and then compose them with input vision and language features to perform downstream reasoning. Such a way of regularizing self-attention with graph information significantly improves the inferring ability and helps align features from different modalities. We validate the effectiveness of Multimodal Graph Transformer over its Transformer baselines on GQA, VQAv2, and MultiModalQA datasets.

READ FULL TEXT
research
12/02/2022

Relation-aware Language-Graph Transformer for Question Answering

Question Answering (QA) is a task that entails reasoning over natural la...
research
07/17/2023

PAT: Parallel Attention Transformer for Visual Question Answering in Vietnamese

We present in this paper a novel scheme for multimodal learning named th...
research
09/14/2022

Graph Perceiver IO: A General Architecture for Graph Structured Data

Multimodal machine learning has been widely studied for the development ...
research
06/30/2020

BERTERS: Multimodal Representation Learning for Expert Recommendation System with Transformer

The objective of an expert recommendation system is to trace a set of ca...
research
09/09/2021

TxT: Crossmodal End-to-End Learning with Transformers

Reasoning over multiple modalities, e.g. in Visual Question Answering (V...
research
10/27/2020

MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering

We present MMFT-BERT(MultiModal Fusion Transformer with BERT encodings),...
research
11/22/2019

Factorized Multimodal Transformer for Multimodal Sequential Learning

The complex world around us is inherently multimodal and sequential (con...

Please sign up or login with your details

Forgot password? Click here to reset