Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language

12/14/2022
by   Alexei Baevski, et al.
0

Current self-supervised learning algorithms are often modality-specific and require large amounts of computational resources. To address these issues, we increase the training efficiency of data2vec, a learning objective that generalizes across several modalities. We do not encode masked tokens, use a fast convolutional decoder and amortize the effort to build teacher representations. data2vec 2.0 benefits from the rich contextualized target representations introduced in data2vec which enable a fast self-supervised learner. Experiments on ImageNet-1K image classification show that data2vec 2.0 matches the accuracy of Masked Autoencoders in 16.4x lower pre-training time, on Librispeech speech recognition it performs as well as wav2vec 2.0 in 10.6x less time, and on GLUE natural language understanding it matches a retrained RoBERTa model in half the time. Trading some speed for accuracy results in ImageNet-1K top-1 accuracy of 86.8% with a ViT-L model trained for 150 epochs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/07/2022

data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

While the general idea of self-supervised learning is identical across m...
research
04/14/2022

Masked Siamese Networks for Label-Efficient Learning

We propose Masked Siamese Networks (MSN), a self-supervised learning fra...
research
10/12/2019

vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations

We propose vq-wav2vec to learn discrete representations of audio segment...
research
12/11/2022

Accelerating Self-Supervised Learning via Efficient Training Strategies

Recently the focus of the computer vision community has shifted from exp...
research
06/15/2021

Self-Supervised Learning with Kernel Dependence Maximization

We approach self-supervised learning of image representations from a sta...
research
11/19/2022

Domain-Adaptive Self-Supervised Pre-Training for Face Body Detection in Drawings

Drawings are powerful means of pictorial abstraction and communication. ...
research
07/03/2020

Noise2Filter: fast, self-supervised learning and real-time reconstruction for 3D Computed Tomography

At X-ray beamlines of synchrotron light sources, the achievable time-res...

Please sign up or login with your details

Forgot password? Click here to reset