Caption Feature Space Regularization for Audio Captioning

04/18/2022
by   Yiming Zhang, et al.
0

Audio captioning aims at describing the content of audio clips with human language. Due to the ambiguity of audio, different people may perceive the same audio differently, resulting in caption disparities (i.e., one audio may correlate to several captions with diverse semantics). For that, general audio captioning models achieve the one-to-many training by randomly selecting a correlated caption as the ground truth for each audio. However, it leads to a significant variation in the optimization directions and weakens the model stability. To eliminate this negative effect, in this paper, we propose a two-stage framework for audio captioning: (i) in the first stage, via the contrastive learning, we construct a proxy feature space to reduce the distances between captions correlated to the same audio, and (ii) in the second stage, the proxy feature space is utilized as additional supervision to encourage the model to be optimized in the direction that benefits all the correlated captions. We conducted extensive experiments on two datasets using four commonly used encoder and decoder architectures. Experimental results demonstrate the effectiveness of the proposed method. The code is available at https://github.com/PRIS-CV/Caption-Feature-Space-Regularization.

READ FULL TEXT
research
12/05/2022

Towards Generating Diverse Audio Captions via Adversarial Training

Automated audio captioning is a cross-modal translation task for describ...
research
07/09/2020

Multi-task Regularization Based on Infrequent Classes for Audio Captioning

Audio captioning is a multi-modal task, focusing on using natural langua...
research
09/18/2023

RECAP: Retrieval-Augmented Audio Captioning

We present RECAP (REtrieval-Augmented Audio CAPtioning), a novel and eff...
research
07/21/2021

CL4AC: A Contrastive Loss for Audio Captioning

Automated Audio captioning (AAC) is a cross-modal translation task that ...
research
09/15/2023

Audio Difference Learning for Audio Captioning

This study introduces a novel training paradigm, audio difference learni...
research
09/01/2023

CoNeTTE: An efficient Audio Captioning system leveraging multiple datasets with Task Embedding

Automated Audio Captioning (AAC) involves generating natural language de...
research
04/07/2023

Graph Attention for Automated Audio Captioning

State-of-the-art audio captioning methods typically use the encoder-deco...

Please sign up or login with your details

Forgot password? Click here to reset