EnCodecMAE: Leveraging neural codecs for universal audio representation learning

09/14/2023
by   Leonardo Pepino, et al.
0

The goal of universal audio representation learning is to obtain foundational models that can be used for a variety of downstream tasks involving speech, music or environmental sounds. To approach this problem, methods inspired by self-supervised models from NLP, like BERT, are often used and adapted to audio. These models rely on the discrete nature of text, hence adopting this type of approach for audio processing requires either a change in the learning objective or mapping the audio signal to a set of discrete classes. In this work, we explore the use of EnCodec, a neural audio codec, to generate discrete targets for learning an universal audio model based on a masked autoencoder (MAE). We evaluate this approach, which we call EncodecMAE, on a wide range of audio tasks spanning speech, music and environmental sounds, achieving performances comparable or better than leading audio representation models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/29/2023

UniBriVL: Robust Universal Representation and Generation of Audio Driven Diffusion Models

Multimodal large models have been recognized for their advantages in var...
research
11/23/2021

Towards Learning Universal Audio Representations

The ability to learn universal audio representations that can solve dive...
research
08/24/2023

Sparks of Large Audio Models: A Survey and Outlook

This survey paper provides a comprehensive overview of the recent advanc...
research
10/27/2020

Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags

Self-supervised audio representation learning offers an attractive alter...
research
07/07/2022

Self-Supervised Learning of Music-Dance Representation through Explicit-Implicit Rhythm Synchronization

Although audio-visual representation has been proved to be applicable in...
research
03/06/2022

HEAR 2021: Holistic Evaluation of Audio Representations

What audio embedding approach generalizes best to a wide range of downst...
research
05/31/2023

Learning Music Sequence Representation from Text Supervision

Music representation learning is notoriously difficult for its complex h...

Please sign up or login with your details

Forgot password? Click here to reset