Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation

11/12/2022
by   Yusong Wu, et al.
0

Contrastive learning has shown remarkable success in the field of multimodal representation learning. In this paper, we propose a pipeline of contrastive language-audio pretraining to develop an audio representation by combining audio data with natural language descriptions. To accomplish this target, we first release LAION-Audio-630K, a large collection of 633,526 audio-text pairs from different data sources. Second, we construct a contrastive language-audio pretraining model by considering different audio encoders and text encoders. We incorporate the feature fusion mechanism and keyword-to-caption augmentation into the model design to further enable the model to process audio inputs of variable lengths and enhance the performance. Third, we perform comprehensive experiments to evaluate our model across three tasks: text-to-audio retrieval, zero-shot audio classification, and supervised audio classification. The results demonstrate that our model achieves superior performance in text-to-audio retrieval task. In audio classification tasks, the model achieves state-of-the-art performance in the zero-shot setting and is able to obtain performance comparable to models' results in the non-zero-shot setting. LAION-Audio-630K and the proposed model are both available to the public.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/11/2023

Natural Language Supervision for General-Purpose Audio Representations

Audio-Language models jointly learn multimodal text and audio representa...
research
06/09/2022

CLAP: Learning Audio Concepts From Natural Language Supervision

Mainstream Audio Analytics models are trained to learn under the paradig...
research
02/02/2021

PSLA: Improving Audio Event Classification with Pretraining, Sampling, Labeling, and Aggregation

Audio event classification is an active research area and has a wide ran...
research
04/07/2022

Unified Contrastive Learning in Image-Text-Label Space

Visual recognition is recently learned via either supervised learning on...
research
09/17/2023

Zero- and Few-shot Sound Event Localization and Detection

Sound event localization and detection (SELD) systems estimate direction...
research
04/12/2023

RECLIP: Resource-efficient CLIP by Training with Small Images

We present RECLIP (Resource-efficient CLIP), a simple method that minimi...
research
08/31/2023

PhonMatchNet: Phoneme-Guided Zero-Shot Keyword Spotting for User-Defined Keywords

This study presents a novel zero-shot user-defined keyword spotting mode...

Please sign up or login with your details

Forgot password? Click here to reset