TxT: Crossmodal End-to-End Learning with Transformers

09/09/2021
by   Jan-Martin O. Steitz, et al.
14

Reasoning over multiple modalities, e.g. in Visual Question Answering (VQA), requires an alignment of semantic concepts across domains. Despite the widespread success of end-to-end learning, today's multimodal pipelines by and large leverage pre-extracted, fixed features from object detectors, typically Faster R-CNN, as representations of the visual world. The obvious downside is that the visual representation is not specifically tuned to the multimodal task at hand. At the same time, while transformer-based object detectors have gained popularity, they have not been employed in today's multimodal pipelines. We address both shortcomings with TxT, a transformer-based crossmodal pipeline that enables fine-tuning both language and visual components on the downstream task in a fully end-to-end manner. We overcome existing limitations of transformer-based detectors for multimodal reasoning regarding the integration of global context and their scalability. Our transformer-based multimodal model achieves considerable gains from end-to-end learning for multimodal question answering.

READ FULL TEXT

page 19

page 20

page 21

research
10/01/2020

ISAAQ – Mastering Textbook Questions with Pre-trained Transformers and Bottom-Up and Top-Down Attention

Textbook Question Answering is a complex task in the intersection of Mac...
research
04/30/2023

Multimodal Graph Transformer for Multimodal Question Answering

Despite the success of Transformer models in vision and language tasks, ...
research
03/30/2022

VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers

Breakthroughs in transformer-based models have revolutionized not only t...
research
10/27/2020

MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering

We present MMFT-BERT(MultiModal Fusion Transformer with BERT encodings),...
research
02/04/2023

Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer

This paper presents a new method for end-to-end Video Question Answering...
research
12/16/2021

KAT: A Knowledge Augmented Transformer for Vision-and-Language

The primary focus of recent work with largescale transformers has been o...
research
12/15/2020

Enhance Multimodal Transformer With External Label And In-Domain Pretrain: Hateful Meme Challenge Winning Solution

Hateful meme detection is a new research area recently brought out that ...

Please sign up or login with your details

Forgot password? Click here to reset