Natural Language Supervision for General-Purpose Audio Representations

09/11/2023
by   Benjamin Elizalde, et al.
0

Audio-Language models jointly learn multimodal text and audio representations that enable Zero-Shot inference. Models rely on the encoders to create powerful representations of the input and generalize to multiple tasks ranging from sounds, music, and speech. Although models have achieved remarkable performance, there is still a performance gap with task-specific models. In this paper, we propose a Contrastive Language-Audio Pretraining model that is pretrained with a diverse collection of 4.6M audio-text pairs employing two innovative encoders for Zero-Shot inference. To learn audio representations, we trained an audio encoder on 22 audio tasks, instead of the standard training of sound event classification. To learn language representations, we trained an autoregressive decoder-only model instead of the standard encoder-only models. Then, the audio and language representations are brought into a joint multimodal space using Contrastive Learning. We used our encoders to improve the downstream performance by a margin. We extensively evaluated the generalization of our representations on 26 downstream tasks, the largest in the literature. Our model achieves state of the art results in several tasks leading the way towards general-purpose audio representations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/09/2022

CLAP: Learning Audio Concepts From Natural Language Supervision

Mainstream Audio Analytics models are trained to learn under the paradig...
research
11/12/2022

Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation

Contrastive learning has shown remarkable success in the field of multim...
research
05/19/2023

Pengi: An Audio Language Model for Audio Tasks

In the domain of audio processing, Transfer Learning has facilitated the...
research
03/24/2023

VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining

Assessing the aesthetics of an image is challenging, as it is influenced...
research
05/03/2023

Unsupervised Improvement of Audio-Text Cross-Modal Representations

Recent advances in using language models to obtain cross-modal audio-tex...
research
09/15/2023

Audio-free Prompt Tuning for Language-Audio Models

Contrastive Language-Audio Pretraining (CLAP) is pre-trained to associat...
research
10/08/2022

APE: Aligning Pretrained Encoders to Quickly Learn Aligned Multimodal Representations

Recent advances in learning aligned multimodal representations have been...

Please sign up or login with your details

Forgot password? Click here to reset