Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

10/28/2022
by   Xubo Liu, et al.
0

Audio captioning is the task of generating captions that describe the content of audio clips. In the real world, many objects produce similar sounds. It is difficult to identify these auditory ambiguous sound events with access to audio information only. How to accurately recognize ambiguous sounds is a major challenge for audio captioning systems. In this work, inspired by the audio-visual multi-modal perception of human beings, we propose visually-aware audio captioning, which makes use of visual information to help the recognition of ambiguous sounding objects. Specifically, we introduce an off-the-shelf visual encoder to process the video inputs, and incorporate the extracted visual features into an audio captioning system. Furthermore, to better exploit complementary contexts from redundant audio-visual streams, we propose an audio-visual attention mechanism that integrates audio and visual information adaptively according to their confidence levels. Experimental results on AudioCaps, the largest publicly available audio captioning dataset, show that the proposed method achieves significant improvement over a strong baseline audio captioning system and is on par with the state-of-the-art result.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/04/2021

Audio Captioning Using Sound Event Detection

This technical report proposes an audio captioning system for DCASE 2021...
research
07/09/2020

Multi-task Regularization Based on Infrequent Classes for Audio Captioning

Audio captioning is a multi-modal task, focusing on using natural langua...
research
06/27/2020

Listen carefully and tell: an audio captioning system based on residual learning and gammatone audio representation

Automated audio captioning is machine listening task whose goal is to de...
research
03/17/2020

Multi-modal Dense Video Captioning

Dense video captioning is a task of localizing interesting events from a...
research
11/18/2022

Impact of visual assistance for automated audio captioning

We study the impact of visual assistance for automated audio captioning....
research
12/15/2021

Dense Video Captioning Using Unsupervised Semantic Information

We introduce a method to learn unsupervised semantic visual information ...
research
09/19/2023

FoleyGen: Visually-Guided Audio Generation

Recent advancements in audio generation have been spurred by the evoluti...

Please sign up or login with your details

Forgot password? Click here to reset