Audiovisual Masked Autoencoders

Can we leverage the audiovisual information already present in video to improve self-supervised representation learning? To answer this question, we study various pretraining architectures and objectives within the masked autoencoding framework, motivated by the success of similar methods in natural language and image understanding. We show that we can achieve significant improvements on audiovisual downstream classification tasks, surpassing the state-of-the-art on VGGSound and AudioSet. Furthermore, we can leverage our audiovisual pretraining scheme for multiple unimodal downstream tasks using a single audiovisual pretrained model. We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens without pretraining specifically for this dataset.

READ FULL TEXT

page 1

page 4

page 15

research
09/05/2022

ChemBERTa-2: Towards Chemical Foundation Models

Large pretrained models such as GPT-3 have had tremendous impact on mode...
research
05/28/2021

Domain-Adaptive Pretraining Methods for Dialogue Understanding

Language models like BERT and SpanBERT pretrained on open-domain data ha...
research
11/01/2021

When Does Contrastive Learning Preserve Adversarial Robustness from Pretraining to Finetuning?

Contrastive learning (CL) can learn generalizable feature representation...
research
05/18/2023

How does the task complexity of masked pretraining objectives affect downstream performance?

Masked language modeling (MLM) is a widely used self-supervised pretrain...
research
06/15/2020

To Pretrain or Not to Pretrain: Examining the Benefits of Pretraining on Resource Rich Tasks

Pretraining NLP models with variants of Masked Language Model (MLM) obje...
research
06/05/2023

Explore and Exploit the Diverse Knowledge in Model Zoo for Domain Generalization

The proliferation of pretrained models, as a result of advancements in p...
research
10/09/2022

Self-supervised Video Representation Learning with Motion-Aware Masked Autoencoders

Masked autoencoders (MAEs) have emerged recently as art self-supervised ...

Please sign up or login with your details

Forgot password? Click here to reset