Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input

06/25/2023
by   Qingpei Guo, et al.
0

The ability to model intra-modal and inter-modal interactions is fundamental in multimodal machine learning. The current state-of-the-art models usually adopt deep learning models with fixed structures. They can achieve exceptional performances on specific tasks, but face a particularly challenging problem of modality mismatch because of diversity of input modalities and their fixed structures. In this paper, we present Switch-BERT for joint vision and language representation learning to address this problem. Switch-BERT extends BERT architecture by introducing learnable layer-wise and cross-layer interactions. It learns to optimize attention from a set of attention modes representing these interactions. One specific property of the model is that it learns to attend outputs from various depths, therefore mitigates the modality mismatch problem. We present extensive experiments on visual question answering, image-text retrieval and referring expression comprehension experiments. Results confirm that, whereas alternative architectures including ViLBERT and UNITER may excel in particular tasks, Switch-BERT can consistently achieve better or comparable performances than the current state-of-the-art models in these tasks. Ablation studies indicate that the proposed model achieves superior performances due to its ability in learning task-specific multimodal interactions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/15/2019

M-BERT: Injecting Multimodal Information in the BERT Structure

Multimodal language analysis is an emerging research area in natural lan...
research
08/12/2019

Multimodal Unified Attention Networks for Vision-and-Language Interactions

Learning an effective attention mechanism for multimodal data is importa...
research
10/27/2020

MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering

We present MMFT-BERT(MultiModal Fusion Transformer with BERT encodings),...
research
12/13/2018

Dynamic Fusion with Intra- and Inter- Modality Attention Flow for Visual Question Answering

Learning effective fusion of multi-modality features is at the heart of ...
research
02/07/2022

Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

In this work, we pursue a unified paradigm for multimodal pretraining to...
research
01/12/2023

Multimodal Deep Learning

This book is the result of a seminar in which we reviewed multimodal app...
research
06/05/2016

Multimodal Residual Learning for Visual QA

Deep neural networks continue to advance the state-of-the-art of image r...

Please sign up or login with your details

Forgot password? Click here to reset