Diverse Audio Captioning via Adversarial Training

10/13/2021
by   Xinhao Mei, et al.
0

Audio captioning aims at generating natural language descriptions for audio clips automatically. Existing audio captioning models have shown promising improvement in recent years. However, these models are mostly trained via maximum likelihood estimation (MLE),which tends to make captions generic, simple and deterministic. As different people may describe an audio clip from different aspects using distinct words and grammars, we argue that an audio captioning system should have the ability to generate diverse captions for a fixed audio clip and across similar audio clips. To address this problem, we propose an adversarial training framework for audio captioning based on a conditional generative adversarial network (C-GAN), which aims at improving the naturalness and diversity of generated captions. Unlike processing data of continuous values in a classical GAN, a sentence is composed of discrete tokens and the discrete sampling process is non-differentiable. To address this issue, policy gradient, a reinforcement learning technique, is used to back-propagate the reward to the generator. The results show that our proposed model can generate more diverse captions, as compared to state-of-the-art methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/05/2022

Towards Generating Diverse Audio Captions via Adversarial Training

Automated audio captioning is a cross-modal translation task for describ...
research
03/17/2017

Towards Diverse and Natural Image Descriptions via a Conditional GAN

Despite the substantial progress in recent years, the image captioning t...
research
08/08/2019

Towards Generating Stylized Image Captions via Adversarial Training

While most image captioning aims to generate objective descriptions of i...
research
03/30/2017

Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training

While strong progress has been made in image captioning over the last ye...
research
04/03/2018

Generating Diverse and Accurate Visual Captions by Comparative Adversarial Learning

We study how to generate captions that are not only accurate in describi...
research
09/18/2023

Synth-AC: Enhancing Audio Captioning with Synthetic Supervision

Data-driven approaches hold promise for audio captioning. However, the d...
research
03/30/2023

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

The advancement of audio-language (AL) multimodal learning tasks has bee...

Please sign up or login with your details

Forgot password? Click here to reset