Impact of visual assistance for automated audio captioning

11/18/2022
by   Wim Boes, et al.
0

We study the impact of visual assistance for automated audio captioning. Utilizing multi-encoder transformer architectures, which have previously been employed to introduce vision-related information in the context of sound event detection, we analyze the usefulness of incorporating a variety of pretrained features. We perform experiments on a YouTube-based audiovisual data set and investigate the effect of applying the considered transfer learning technique in terms of a variety of captioning metrics. We find that only one of the considered kinds of pretrained features provides consistent improvements, while the others do not provide any noteworthy gains at all. Interestingly, the outcomes of prior research efforts indicate that the exact opposite is true in the case of sound event detection, leading us to conclude that the optimal choice of visual embeddings is strongly dependent on the task at hand. More specifically, visual features focusing on semantics appear appropriate in the context of automated audio captioning, while for sound event detection, time information seems to be more important.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/09/2021

Audiovisual transfer learning for audio tagging and sound event detection

We study the merit of transfer learning for two sound recognition proble...
research
06/02/2023

Enhance Temporal Relations in Audio Captioning with Sound Event Detection

Automated audio captioning aims at generating natural language descripti...
research
09/26/2022

Multi-encoder attention-based architectures for sound recognition with partial visual assistance

Large-scale sound recognition data sets typically consist of acoustic re...
research
10/28/2022

Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

Audio captioning is the task of generating captions that describe the co...
research
09/01/2023

CoNeTTE: An efficient Audio Captioning system leveraging multiple datasets with Task Embedding

Automated Audio Captioning (AAC) involves generating natural language de...
research
07/01/2020

A Transformer-based Audio Captioning Model with Keyword Estimation

One of the problems with automated audio captioning (AAC) is the indeter...
research
09/06/2023

Parameter Efficient Audio Captioning With Faithful Guidance Using Audio-text Shared Latent Representation

There has been significant research on developing pretrained transformer...

Please sign up or login with your details

Forgot password? Click here to reset