Weakly-supervised Automated Audio Captioning via text only training

09/21/2023
by   Theodoros Kouzelis, et al.
0

In recent years, datasets of paired audio and captions have enabled remarkable success in automatically generating descriptions for audio clips, namely Automated Audio Captioning (AAC). However, it is labor-intensive and time-consuming to collect a sufficient number of paired audio and captions. Motivated by the recent advances in Contrastive Language-Audio Pretraining (CLAP), we propose a weakly-supervised approach to train an AAC model assuming only text data and a pre-trained CLAP model, alleviating the need for paired target data. Our approach leverages the similarity between audio and text embeddings in CLAP. During training, we learn to reconstruct the text from the CLAP text embedding, and during inference, we decode using the audio embeddings. To mitigate the modality gap between the audio and text embeddings we employ strategies to bridge the gap during training and inference stages. We evaluate our proposed method on Clotho and AudioCaps datasets demonstrating its ability to achieve a relative performance of up to  83% compared to fully supervised approaches trained with paired target data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/14/2023

Training Audio Captioning Models without Audio

Automated Audio Captioning (AAC) is the task of generating natural langu...
research
03/30/2023

Prefix tuning for automated audio captioning

Audio captioning aims to generate text descriptions from environmental s...
research
04/01/2022

Learning Audio-Video Modalities from Image Captions

A major challenge in text-video and text-audio retrieval is the lack of ...
research
03/30/2023

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

The advancement of audio-language (AL) multimodal learning tasks has bee...
research
11/02/2022

Adversarial Guitar Amplifier Modelling With Unpaired Data

We propose an audio effects processing framework that learns to emulate ...
research
10/24/2020

Weakly-supervised VisualBERT: Pre-training without Parallel Images and Captions

Pre-trained contextual vision-and-language (V L) models have brought i...
research
09/06/2023

Parameter Efficient Audio Captioning With Faithful Guidance Using Audio-text Shared Latent Representation

There has been significant research on developing pretrained transformer...

Please sign up or login with your details

Forgot password? Click here to reset