BEATs: Audio Pre-Training with Acoustic Tokenizers

12/18/2022
by   Sanyuan Chen, et al.
0

The massive growth of self-supervised learning (SSL) has been witnessed in language, vision, speech, and audio domains over the past few years. While discrete label prediction is widely adopted for other modalities, the state-of-the-art audio SSL models still employ reconstruction loss for pre-training. Compared with reconstruction loss, semantic-rich discrete label prediction encourages the SSL model to abstract the high-level audio semantics and discard the redundant details as in human perception. However, a semantic-rich acoustic tokenizer for general audio pre-training is usually not straightforward to obtain, due to the continuous property of audio and unavailable phoneme sequences like speech. To tackle this challenge, we propose BEATs, an iterative audio pre-training framework to learn Bidirectional Encoder representation from Audio Transformers, where an acoustic tokenizer and an audio SSL model are optimized by iterations. In the first iteration, we use random projection as the acoustic tokenizer to train an audio SSL model in a mask and label prediction manner. Then, we train an acoustic tokenizer for the next iteration by distilling the semantic knowledge from the pre-trained or fine-tuned audio SSL model. The iteration is repeated with the hope of mutual promotion of the acoustic tokenizer and audio SSL model. The experimental results demonstrate our acoustic tokenizers can generate discrete labels with rich audio semantics and our audio SSL models achieve state-of-the-art results across various audio classification benchmarks, even outperforming previous models that use more training data and model parameters significantly. Specifically, we set a new state-of-the-art mAP 50.6 audio-only models without using any external data, and 98.1 ESC-50. The code and pre-trained models are available at https://aka.ms/beats.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/23/2023

Self-Supervised Learning for Audio-Based Emotion Recognition

Emotion recognition models using audio input data can enable the develop...
research
11/10/2019

Effectiveness of self-supervised pre-training for speech recognition

We present pre-training approaches for self-supervised representation le...
research
09/24/2019

Understanding Semantics from Speech Through Pre-training

End-to-end Spoken Language Understanding (SLU) is proposed to infer the ...
research
12/09/2021

LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading

The aim of this work is to investigate the impact of crossmodal self-sup...
research
05/22/2023

ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer

Text-to-speech(TTS) has undergone remarkable improvements in performance...
research
02/06/2023

Autodecompose: A generative self-supervised model for semantic decomposition

We introduce Autodecompose, a novel self-supervised generative model tha...
research
04/25/2023

LEMaRT: Label-Efficient Masked Region Transform for Image Harmonization

We present a simple yet effective self-supervised pre-training method fo...

Please sign up or login with your details

Forgot password? Click here to reset