Leveraging Pre-trained BERT for Audio Captioning

03/06/2022
by   Xubo Liu, et al.
13

Audio captioning aims at using natural language to describe the content of an audio clip. Existing audio captioning systems are generally based on an encoder-decoder architecture, in which acoustic information is extracted by an audio encoder and then a language decoder is used to generate the captions. Training an audio captioning system often encounters the problem of data scarcity. Transferring knowledge from pre-trained audio models such as Pre-trained Audio Neural Networks (PANNs) have recently emerged as a useful method to mitigate this issue. However, there is less attention on exploiting pre-trained language models for the decoder, compared with the encoder. BERT is a pre-trained language model that has been extensively used in Natural Language Processing (NLP) tasks. Nevertheless, the potential of BERT as the language decoder for audio captioning has not been investigated. In this study, we demonstrate the efficacy of the pre-trained BERT model for audio captioning. Specifically, we apply PANNs as the encoder and initialize the decoder from the public pre-trained BERT models. We conduct an empirical study on the use of these BERT models for the decoder in the audio captioning model. Our models achieve competitive results with the existing audio captioning methods on the AudioCaps dataset.

READ FULL TEXT
research
12/14/2020

Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval

The goal of audio captioning is to translate input audio into its descri...
research
10/12/2021

Improving the Performance of Automated Audio Captioning via Integrating the Acoustic and Semantic Information

Automated audio captioning (AAC) has developed rapidly in recent years, ...
research
04/18/2022

Automated Audio Captioning using Audio Event Clues

Audio captioning is an important research area that aims to generate mea...
research
10/14/2021

Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning

Automated audio captioning (AAC) is the task of automatically generating...
research
08/12/2022

An investigation on selecting audio pre-trained models for audio captioning

Audio captioning is a task that generates description of audio based on ...
research
07/09/2020

Multi-task Regularization Based on Infrequent Classes for Audio Captioning

Audio captioning is a multi-modal task, focusing on using natural langua...
research
06/17/2019

Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models

Modern text-to-speech (TTS) systems are able to generate audio that soun...

Please sign up or login with your details

Forgot password? Click here to reset