CoNeTTE: An efficient Audio Captioning system leveraging multiple datasets with Task Embedding

09/01/2023
by   Etienne Labbé, et al.
0

Automated Audio Captioning (AAC) involves generating natural language descriptions of audio content, using encoder-decoder architectures. An audio encoder produces audio embeddings fed to a decoder, usually a Transformer decoder, for caption generation. In this work, we describe our model, which novelty, compared to existing models, lies in the use of a ConvNeXt architecture as audio encoder, adapted from the vision domain to audio classification. This model, called CNext-trans, achieved state-of-the-art scores on the AudioCaps (AC) dataset and performed competitively on Clotho (CL), while using four to forty times fewer parameters than existing models. We examine potential biases in the AC dataset due to its origin from AudioSet by investigating unbiased encoder's impact on performance. Using the well-known PANN's CNN14, for instance, as an unbiased encoder, we observed a 1.7 reduction in SPIDEr score (where higher scores indicate better performance). To improve cross-dataset performance, we conducted experiments by combining multiple AAC datasets (AC, CL, MACS, WavCaps) for training. Although this strategy enhanced overall model performance across datasets, it still fell short compared to models trained specifically on a single target dataset, indicating the absence of a one-size-fits-all model. To mitigate performance gaps between datasets, we introduced a Task Embedding (TE) token, allowing the model to identify the source dataset for each input sample. We provide insights into the impact of these TEs on both the form (words) and content (sound event types) of the generated captions. The resulting model, named CoNeTTE, an unbiased CNext-trans model enriched with dataset-specific Task Embeddings, achieved SPIDEr scores of 44.1 available: https://github.com/Labbeti/conette-audio-captioning.

READ FULL TEXT

page 1

page 12

page 13

research
04/18/2022

Automated Audio Captioning using Audio Event Clues

Audio captioning is an important research area that aims to generate mea...
research
05/13/2021

Audio Captioning with Composition of Acoustic and Semantic Information

Generating audio captions is a new research area that combines audio and...
research
08/05/2021

An Encoder-Decoder Based Audio Captioning System With Transfer and Reinforcement Learning

Automated audio captioning aims to use natural language to describe the ...
research
01/28/2022

Automatic Audio Captioning using Attention weighted Event based Embeddings

Automatic Audio Captioning (AAC) refers to the task of translating audio...
research
05/02/2023

Multitask learning in Audio Captioning: a sentence embedding regression loss acts as a regularizer

In this work, we propose to study the performance of a model trained wit...
research
11/18/2022

Impact of visual assistance for automated audio captioning

We study the impact of visual assistance for automated audio captioning....
research
04/18/2022

Caption Feature Space Regularization for Audio Captioning

Audio captioning aims at describing the content of audio clips with huma...

Please sign up or login with your details

Forgot password? Click here to reset