Audio Captioning Transformer

07/21/2021
by   Xinhao Mei, et al.
0

Audio captioning aims to automatically generate a natural language description of an audio clip. Most captioning models follow an encoder-decoder architecture, where the decoder predicts words based on the audio features extracted by the encoder. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are often used as the audio encoder. However, CNNs can be limited in modelling temporal relationships among the time frames in an audio signal, while RNNs can be limited in modelling the long-range dependencies among the time frames. In this paper, we propose an Audio Captioning Transformer (ACT), which is a full Transformer network based on an encoder-decoder architecture and is totally convolution-free. The proposed method has a better ability to model the global information within an audio signal as well as capture temporal relationships between audio events. We evaluate our model on AudioCaps, which is the largest audio captioning dataset publicly available. Our model shows competitive performance compared to other state-of-the-art approaches.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/10/2022

Local Information Assisted Attention-free Decoder for Audio Captioning

Automated audio captioning (AAC) aims to describe audio data with captio...
research
04/07/2023

Graph Attention for Automated Audio Captioning

State-of-the-art audio captioning methods typically use the encoder-deco...
research
12/02/2021

TCTN: A 3D-Temporal Convolutional Transformer Network for Spatiotemporal Predictive Learning

Spatiotemporal predictive learning is to generate future frames given a ...
research
01/28/2022

Automatic Audio Captioning using Attention weighted Event based Embeddings

Automatic Audio Captioning (AAC) refers to the task of translating audio...
research
05/09/2022

Fatigue Prediction in Outdoor Running Conditions using Audio Data

Although running is a common leisure activity and a core training regime...
research
06/30/2017

Automated Audio Captioning with Recurrent Neural Networks

We present the first approach to automated audio captioning. We employ a...
research
10/31/2022

Audio Time-Scale Modification with Temporal Compressing Networks

We proposed a novel approach in the field of time-scale modification on ...

Please sign up or login with your details

Forgot password? Click here to reset