MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering

10/27/2020
by   Aisha Urooj Khan, et al.
25

We present MMFT-BERT(MultiModal Fusion Transformer with BERT encodings), to solve Visual Question Answering (VQA) ensuring individual and combined processing of multiple input modalities. Our approach benefits from processing multimodal data (video and text) adopting the BERT encodings individually and using a novel transformer-based fusion method to fuse them together. Our method decomposes the different sources of modalities, into different BERT instances with similar architectures, but variable weights. This achieves SOTA results on the TVQA dataset. Additionally, we provide TVQA-Visual, an isolated diagnostic subset of TVQA, which strictly requires the knowledge of visual (V) modality based on a human annotator's judgment. This set of questions helps us to study the model's behavior and the challenges TVQA poses to prevent the achievement of super human performance. Extensive experiments show the effectiveness and superiority of our method.

READ FULL TEXT

page 3

page 6

page 7

page 13

research
07/06/2023

UIT-Saviors at MEDVQA-GI 2023: Improving Multimodal Learning with Image Enhancement for Gastrointestinal Visual Question Answering

In recent years, artificial intelligence has played an important role in...
research
08/10/2019

Multi-modality Latent Interaction Network for Visual Question Answering

Exploiting relationships between visual regions and question words have ...
research
04/30/2023

Multimodal Graph Transformer for Multimodal Question Answering

Despite the success of Transformer models in vision and language tasks, ...
research
12/14/2021

Dual-Key Multimodal Backdoors for Visual Question Answering

The success of deep learning has enabled advances in multimodal tasks th...
research
06/25/2023

Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input

The ability to model intra-modal and inter-modal interactions is fundame...
research
01/31/2019

BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection

Multimodal representation learning is gaining more and more interest wit...
research
09/09/2021

TxT: Crossmodal End-to-End Learning with Transformers

Reasoning over multiple modalities, e.g. in Visual Question Answering (V...

Please sign up or login with your details

Forgot password? Click here to reset