CL4AC: A Contrastive Loss for Audio Captioning

07/21/2021
by   Xubo Liu, et al.
0

Automated Audio captioning (AAC) is a cross-modal translation task that aims to use natural language to describe the content of an audio clip. As shown in the submissions received for Task 6 of the DCASE 2021 Challenges, this problem has received increasing interest in the community. The existing AAC systems are usually based on an encoder-decoder architecture, where the audio signal is encoded into a latent representation, and aligned with its corresponding text descriptions, then a decoder is used to generate the captions. However, training of an AAC system often encounters the problem of data scarcity, which may lead to inaccurate representation and audio-text alignment. To address this problem, we propose a novel encoder-decoder framework called Contrastive Loss for Audio Captioning (CL4AC). In CL4AC, the self-supervision signals derived from the original audio-text paired data are used to exploit the correspondences between audio and texts by contrasting samples, which can improve the quality of latent representation and the alignment between audio and texts, while trained with limited data. Experiments are performed on the Clotho dataset to show the effectiveness of our proposed approach.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/14/2023

Training Audio Captioning Models without Audio

Automated Audio Captioning (AAC) is the task of generating natural langu...
research
07/09/2020

Multi-task Regularization Based on Infrequent Classes for Audio Captioning

Audio captioning is a multi-modal task, focusing on using natural langua...
research
05/12/2022

Automated Audio Captioning: an Overview of Recent Progress and New Challenges

Automated audio captioning is a cross-modal translation task that aims t...
research
08/05/2021

An Encoder-Decoder Based Audio Captioning System With Transfer and Reinforcement Learning

Automated audio captioning aims to use natural language to describe the ...
research
10/10/2022

Automated Audio Captioning via Fusion of Low- and High- Dimensional Features

Automated audio captioning (AAC) aims to describe the content of an audi...
research
04/17/2018

Bootstrapping Generators from Noisy Data

A core step in statistical data-to-text generation concerns learning cor...
research
04/18/2022

Caption Feature Space Regularization for Audio Captioning

Audio captioning aims at describing the content of audio clips with huma...

Please sign up or login with your details

Forgot password? Click here to reset