Zorro: the masked multimodal transformer

01/23/2023
by   Adria Recasens, et al.
17

Attention-based models are appealing for multimodal processing because inputs from multiple modalities can be concatenated and fed to a single backbone network - thus requiring very little fusion engineering. The resulting representations are however fully entangled throughout the network, which may not always be desirable: in learning, contrastive audio-visual self-supervised learning requires independent audio and visual features to operate, otherwise learning collapses; in inference, evaluation of audio-visual models should be possible on benchmarks having just audio or just video. In this paper, we introduce Zorro, a technique that uses masks to control how inputs from each modality are routed inside Transformers, keeping some parts of the representation modality-pure. We apply this technique to three popular transformer-based architectures (ViT, Swin and HiP) and show that with contrastive pre-training Zorro achieves state-of-the-art results on most relevant benchmarks for multimodal tasks (AudioSet and VGGSound). Furthermore, the resulting models are able to perform unimodal inference on both video and audio benchmarks such as Kinetics-400 or ESC-50.

READ FULL TEXT

page 2

page 3

page 13

page 14

research
06/30/2021

Attention Bottlenecks for Multimodal Fusion

Humans perceive the world by concurrently processing and fusing high-dim...
research
04/22/2021

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

We present a framework for learning multimodal representations from unla...
research
03/21/2023

ModEFormer: Modality-Preserving Embedding for Audio-Video Synchronization using Transformers

Lack of audio-video synchronization is a common problem during televisio...
research
10/27/2022

Multimodal Transformer Distillation for Audio-Visual Synchronization

Audio-visual synchronization aims to determine whether the mouth movemen...
research
06/08/2023

Factorized Contrastive Learning: Going Beyond Multi-view Redundancy

In a wide range of multimodal tasks, contrastive learning has become a p...
research
12/09/2022

Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in Transformers

Previous studies have explored generating accurately lip-synced talking ...
research
12/02/2019

Audiovisual Transformer Architectures for Large-Scale Classification and Synchronization of Weakly Labeled Audio Events

We tackle the task of environmental event classification by drawing insp...

Please sign up or login with your details

Forgot password? Click here to reset