DeepAI
Log In Sign Up

MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering

10/27/2020
by   Aisha Urooj Khan, et al.
25

We present MMFT-BERT(MultiModal Fusion Transformer with BERT encodings), to solve Visual Question Answering (VQA) ensuring individual and combined processing of multiple input modalities. Our approach benefits from processing multimodal data (video and text) adopting the BERT encodings individually and using a novel transformer-based fusion method to fuse them together. Our method decomposes the different sources of modalities, into different BERT instances with similar architectures, but variable weights. This achieves SOTA results on the TVQA dataset. Additionally, we provide TVQA-Visual, an isolated diagnostic subset of TVQA, which strictly requires the knowledge of visual (V) modality based on a human annotator's judgment. This set of questions helps us to study the model's behavior and the challenges TVQA poses to prevent the achievement of super human performance. Extensive experiments show the effectiveness and superiority of our method.

READ FULL TEXT

page 3

page 6

page 7

page 13

09/27/2021

Multimodal Integration of Human-Like Attention in Visual Question Answering

Human-like attention as a supervisory signal to guide neural attention h...
08/10/2019

Multi-modality Latent Interaction Network for Visual Question Answering

Exploiting relationships between visual regions and question words have ...
12/14/2021

Dual-Key Multimodal Backdoors for Visual Question Answering

The success of deep learning has enabled advances in multimodal tasks th...
06/30/2020

BERTERS: Multimodal Representation Learning for Expert Recommendation System with Transformer

The objective of an expert recommendation system is to trace a set of ca...
01/31/2019

BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection

Multimodal representation learning is gaining more and more interest wit...
09/26/2020

Techniques to Improve Q A Accuracy with Transformer-based models on Large Complex Documents

This paper discusses the effectiveness of various text processing techni...
09/09/2021

TxT: Crossmodal End-to-End Learning with Transformers

Reasoning over multiple modalities, e.g. in Visual Question Answering (V...

Code Repositories