WaveTransformer: A Novel Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information

10/21/2020
by   An Tran, et al.
18

Automated audio captioning (AAC) is a novel task, where a method takes as an input an audio sample and outputs a textual description (i.e. a caption) of its contents. Most AAC methods are adapted from from image captioning of machine translation fields. In this work we present a novel AAC novel method, explicitly focused on the exploitation of the temporal and time-frequency patterns in audio. We employ three learnable processes for audio encoding, two for extracting the local and temporal information, and one to merge the output of the previous two processes. To generate the caption, we employ the widely used Transformer decoder. We assess our method utilizing the freely available splits of Clotho dataset. Our results increase previously reported highest SPIDEr to 17.3, from 16.2.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/06/2020

Temporal Sub-sampling of Audio Feature Sequences for Automated Audio Captioning

Audio captioning is the task of automatically creating a textual descrip...
research
06/27/2020

Listen carefully and tell: an audio captioning system based on residual learning and gammatone audio representation

Automated audio captioning is machine listening task whose goal is to de...
research
07/08/2022

Automated Audio Captioning and Language-Based Audio Retrieval

This project involved participation in the DCASE 2022 Competition (Task ...
research
06/05/2020

Audio Captioning using Gated Recurrent Units

Audio captioning is a recently proposed task for automatically generatin...
research
10/04/2022

Learning the Spectrogram Temporal Resolution for Audio Classification

The audio spectrogram is a time-frequency representation that has been w...
research
12/28/2021

Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation

Video-to-Text (VTT) is the task of automatically generating descriptions...
research
01/08/2022

A novel audio representation using space filling curves

Since convolutional neural networks (CNNs) have revolutionized the image...

Please sign up or login with your details

Forgot password? Click here to reset