WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

03/30/2023
by   Xinhao Mei, et al.
0

The advancement of audio-language (AL) multimodal learning tasks has been significant in recent years. However, researchers face challenges due to the costly and time-consuming collection process of existing audio-language datasets, which are limited in size. To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions. We sourced audio clips and their raw descriptions from web sources and a sound event detection dataset. However, the online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning. To overcome this issue, we propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically. We conduct a comprehensive analysis of the characteristics of WavCaps dataset and evaluate it on multiple downstream audio-language multimodal learning tasks. The systems trained on WavCaps outperform previous state-of-the-art (SOTA) models by a significant margin. Our aspiration is for the WavCaps dataset we have proposed to facilitate research in audio-language multimodal learning and demonstrate the potential of utilizing ChatGPT to enhance academic research. Our dataset and codes are available at https://github.com/XinhaoMei/WavCaps.

READ FULL TEXT

page 1

page 3

research
07/31/2023

LP-MusicCaps: LLM-Based Pseudo Music Captioning

Automatic music captioning, which generates natural language description...
research
09/20/2023

A Large-scale Dataset for Audio-Language Representation Learning

The AI community has made significant strides in developing powerful fou...
research
09/21/2023

Weakly-supervised Automated Audio Captioning via text only training

In recent years, datasets of paired audio and captions have enabled rema...
research
05/15/2023

A Whisper transformer for audio captioning trained with synthetic captions and transfer learning

The field of audio captioning has seen significant advancements in recen...
research
10/13/2021

Diverse Audio Captioning via Adversarial Training

Audio captioning aims at generating natural language descriptions for au...
research
10/10/2021

Can Audio Captions Be Evaluated with Image Caption Metrics?

Automated audio captioning aims at generating textual descriptions for a...
research
08/23/2021

Learning Sparse Analytic Filters for Piano Transcription

In recent years, filterbank learning has become an increasingly popular ...

Please sign up or login with your details

Forgot password? Click here to reset