RECAP: Retrieval-Augmented Audio Captioning

09/18/2023
by   Sreyan Ghosh, et al.
0

We present RECAP (REtrieval-Augmented Audio CAPtioning), a novel and effective audio captioning system that generates captions conditioned on an input audio and other captions similar to the audio retrieved from a datastore. Additionally, our proposed method can transfer to any domain without the need for any additional fine-tuning. To generate a caption for an audio sample, we leverage an audio-text model CLAP to retrieve captions similar to it from a replaceable datastore, which are then used to construct a prompt. Next, we feed this prompt to a GPT-2 decoder and introduce cross-attention layers between the CLAP encoder and GPT-2 to condition the audio for caption generation. Experiments on two benchmark datasets, Clotho and AudioCaps, show that RECAP achieves competitive performance in in-domain settings and significant improvements in out-of-domain settings. Additionally, due to its capability to exploit a large text-captions-only datastore in a training-free fashion, RECAP shows unique capabilities of captioning novel audio events never seen during training and compositional audios with multiple events. To promote research in this space, we also release 150,000+ new weakly labeled captions for AudioSet, AudioCaps, and Clotho.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/01/2022

Learning Audio-Video Modalities from Image Captions

A major challenge in text-video and text-audio retrieval is the lack of ...
research
09/15/2023

Audio Difference Learning for Audio Captioning

This study introduces a novel training paradigm, audio difference learni...
research
06/04/2022

Automated Audio Captioning with Epochal Difficult Captions for Curriculum Learning

In this paper, we propose an algorithm, Epochal Difficult Captions, to s...
research
05/15/2023

A Whisper transformer for audio captioning trained with synthetic captions and transfer learning

The field of audio captioning has seen significant advancements in recen...
research
09/18/2023

Synth-AC: Enhancing Audio Captioning with Synthetic Supervision

Data-driven approaches hold promise for audio captioning. However, the d...
research
04/18/2022

Caption Feature Space Regularization for Audio Captioning

Audio captioning aims at describing the content of audio clips with huma...
research
08/23/2023

Audio Difference Captioning Utilizing Similarity-Discrepancy Disentanglement

We proposed Audio Difference Captioning (ADC) as a new extension task of...

Please sign up or login with your details

Forgot password? Click here to reset