Training Audio Captioning Models without Audio

09/14/2023
by   Soham Deshmukh, et al.
0

Automated Audio Captioning (AAC) is the task of generating natural language descriptions given an audio stream. A typical AAC system requires manually curated training data of audio segments and corresponding text caption annotations. The creation of these audio-caption pairs is costly, resulting in general data scarcity for the task. In this work, we address this major limitation and propose an approach to train AAC systems using only text. Our approach leverages the multimodal space of contrastively trained audio-text models, such as CLAP. During training, a decoder generates captions conditioned on the pretrained CLAP text encoder. During inference, the text encoder is replaced with the pretrained CLAP audio encoder. To bridge the modality gap between text and audio embeddings, we propose the use of noise injection or a learnable adapter, during training. We find that the proposed text-only framework performs competitively with state-of-the-art models trained with paired audio, showing that efficient text-to-audio transfer is possible. Finally, we showcase both stylized audio captioning and caption enrichment while training without audio or human-created text captions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/21/2023

Weakly-supervised Automated Audio Captioning via text only training

In recent years, datasets of paired audio and captions have enabled rema...
research
07/21/2021

CL4AC: A Contrastive Loss for Audio Captioning

Automated Audio captioning (AAC) is a cross-modal translation task that ...
research
09/06/2023

Parameter Efficient Audio Captioning With Faithful Guidance Using Audio-text Shared Latent Representation

There has been significant research on developing pretrained transformer...
research
11/22/2022

Aligning Source Visual and Target Language Domains for Unpaired Video Captioning

Training supervised video captioning model requires coupled video-captio...
research
09/08/2023

Leveraging Pretrained Image-text Models for Improving Audio-Visual Learning

Visually grounded speech systems learn from paired images and their spok...
research
09/18/2023

Synth-AC: Enhancing Audio Captioning with Synthetic Supervision

Data-driven approaches hold promise for audio captioning. However, the d...
research
06/12/2023

Scalable 3D Captioning with Pretrained Models

We introduce Cap3D, an automatic approach for generating descriptive tex...

Please sign up or login with your details

Forgot password? Click here to reset