Automated Audio Captioning: an Overview of Recent Progress and New Challenges

by   Xinhao Mei, et al.
University of Surrey

Automated audio captioning is a cross-modal translation task that aims to generate natural language descriptions for given audio clips. This task has received increasing attention with the release of freely available datasets in recent years. The problem has been addressed predominantly with deep learning techniques. Numerous approaches have been proposed, such as investigating different neural network architectures, exploiting auxiliary information such as keywords or sentence information to guide caption generation, and employing different training strategies, which have greatly facilitated the development of this field. In this paper, we present a comprehensive review of the published contributions in automated audio captioning, from a variety of existing approaches to evaluation metrics and datasets. Moreover, we discuss open challenges and envisage possible future research directions.



There are no comments yet.


page 1


A Comprehensive Survey of Automated Audio Captioning

Automated audio captioning, a task that mimics human perception as well ...

CL4AC: A Contrastive Loss for Audio Captioning

Automated Audio captioning (AAC) is a cross-modal translation task that ...

The NTT DCASE2020 Challenge Task 6 system: Automated Audio Captioning with Keywords and Sentence Length Estimation

This technical report describes the system participating to the Detectio...

Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning

Automated audio captioning (AAC) is the task of automatically generating...

Audio Retrieval with Natural Language Queries: A Benchmark Study

The objectives of this work are cross-modal text-audio and audio-text re...

Recent Advances and Challenges in Deep Audio-Visual Correlation Learning

Audio-visual correlation learning aims to capture essential corresponden...

Continual Learning for Automated Audio Captioning Using The Learning Without Forgetting Approach

Automated audio captioning (AAC) is the task of automatically creating t...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sound is ubiquitous in our daily lives. It carries a wealth of information about the environment, from sound scenes to individual events happening around us. For most people, the ability to perceive and understand the everyday sounds around us is taken for granted. However, digging out helpful information from sounds is a challenging task for machines. With the development of machine learning, the field of machine listening has attracted increasing attention, with significant progress made in recent years, in areas such as audio tagging

[audioset, Xu2017unsupervised, kong2019weakly, Wang2020modelling], sound event detection [kong2019sed, kong2020sed_weakly, mesaros2021sound]

and acoustic scene classification

[barchiesi2015acoustic, Wang2021sa++]. However, in these areas, the focus has been mostly on identifying acoustic scenes or events in an audio clip, rather than considering relationships between the audio events and acoustic scenes.

Automated audio captioning (AAC) aims at describing the content of an audio clip using natural language. This is a cross-modal translation task at the intersection of audio processing and natural language processing (NLP) [drossos2017ac_1]

. Compared with automatic speech recognition (ASR), audio captioning focuses only on the environmental sounds, rather than the voice content that may be present in an audio clip. Compared with other popular audio-related tasks, audio captioning not only needs to determine what audio events are present, but also needs to capture the spatial-temporal relationships between these audio events and express them in natural language. An example caption may be “a person was walking on a sidewalk adjacent to a school where children were playing on the playground”

111This caption is from the Clotho dataset. which describes the scenes and sound events given an audio clip.

Audio captioning has practical potential for various applications such as helping the hearing-impaired to understand environmental sounds, and analyzing sounds for video-based security surveillance systems. In addition, audio captioning can be used for multimedia retrieval [koepke2022audioretrieval], which can be applied in areas including education, film production, and web searching.

Figure 1: Overview of an encoder-decoder-based AAC system.

Unlike image and video captioning, which have been widely studied for almost a decade, audio captioning is a relatively new task that has been studied since 2017 [drossos2017ac_1]. In the past three years, this field has received increasing attention due to freely available datasets released and being held as a task in DCASE222 Challenges in 2020 and 2021. A number of papers about audio captioning have been published, with deep learning being a popular method. Specifically, the encoder-decoder framework has been adopted as a standard recipe for solving this cross-modal translation task. In this method, the encoder extracts audio features from the input audio clips, and the decoder generates captions based on the extracted audio features. Figure 1

shows an overview of an encoder-decoder based AAC system. Analyzing audio largely depends on obtaining robust audio features, different kinds of neural networks have been investigated as the encoders, such as Recurrent Neural Networks (RNNs)


, Convolutional Neural Networks (CNNs)

[lecun2015deep] and Transformers [vaswani2017attention]. For the decoder, RNNs and Transformers are usually employed, inspired by works in NLP. In addition to the encoder-decoder framework, auxiliary information such as keywords or sentence information [koizumi2020keywords, xu2021audiocaption_car], attentive approaches [kim2019audiocaps, drossos2017ac_1] and different training strategies [xu2021_sjtu, Berg2021continual, Liu2021cl4ac] have been proposed to improve the performance of captioning systems. However, there is still a large gap between achieved results and human level performance [kim2019audiocaps].

To the best of our knowledge, no survey papers on audio captioning have been published so far. In this paper, we aim to provide a comprehensive overview of audio captioning with the hope of stimulating novel research ideas. The articles published by February 2022 in the literature are considered in our survey. The encoder-decoder framework has been a standard recipe for AAC systems, therefore, we develop a taxonomy of acoustic encoding and text decoding approaches.

This paper is organized as follows. In Section 2 and Section 3, we discuss acoustic encoding and text decoding approaches respectively. Auxiliary information is discussed in Section 4. We discuss training strategies adopted in the literature in Section 5. Furthermore, we review popular evaluation metrics and main datasets in Section 6 and Section 7 respectively. Finally, we discuss some open challenges and future research directions in Section 8 and briefly conclude this paper in Section 9.

2 Acoustic encoding

Analyzing the content of an audio clip largely depends on obtaining an effective feature representation for it, which is the aim of the encoder in an AAC system. The time domain waveforms are lengthy 1-D signals, it is challenging for machines to directly identify sound events or sound scenes from raw waveforms [virtanen2018computational]. Current popular approaches for acoustic encoding consist of two steps, first extracting acoustic features, and then passing them into an encoder to obtain compact audio features. In this section, we first discuss popular acoustic features used in literature, then audio encoding approaches, focusing on these based on deep neural networks.

2.1 Acoustic features

Time-frequency representations, such as spectrograms, are widely used as the acoustic features. To obtain a spectrogram, an audio signal is first split into short frames, as audio segment with a length of around 20-60 ms is usually regarded as quasi-stationary [virtanen2018computational]. Then a window function is applied at each frame to enforce continuity and avoid spectral leakage [harris1978leakage]

. After that, short time Fourier transform (STFT) is calculated for each time frame to get the spectrogram. The spectrogram is a 2-D representation whose horizontal axis is time and vertical axis is frequency, the value at each point of the spectrogram represents the energy at a specific time and frequency. Inspired by the selectivity of human auditory system to different frequencies, the frequency axis of a spectrogram may be converted to different scales, resulting in representations such as mel-spectrogram, log mel-spectrogram, and constant-Q transform

[virtanen2018computational]. The log mel-spectrogram is popular due to its superior performance compared with other features [kong2020panns, hershey2017vggish, wu2019audiocaption_hospital]. In addition, mel-frequency cepstral coefficients (MFCCs) were used in some early works [kim2019audiocaps, Ikawa2019ac_spec]

. MFCCs are calculated by applying a discrete cosine transform (DCT) on log mel-spcetrograms. Compared with time-frequency representations, MFCCs contain less information and are only able to estimate the global spectral shape of an audio clip

[virtanen2018computational], thus MFCCs are rarely used in recent works.

2.2 Neural networks

2.2.1 RNNs

RNNs are designed to process sequential data with variable lengths [rumelhart1986rnn]. Audio is time series signal, therefore RNNs are naturally adopted as encoders in initial works. In a simple recipe, a RNN is used to model temporal relationships between acoustic features, and the hidden states of the last layer of the RNN is regarded as the audio feature sequence. Drossos et al. [drossos2017ac_1]

utilized a three-layered bi-directional gated recurrent unit (GRU)

[chung2014gru] as the encoder. Further, unlike using multi-layer RNNs, Xu et al. [xu2021audiocaption_car] and Wu et al. [wu2019audiocaption_hospital] used a single-layered uni-directional GRU while Ikawa et al. [Ikawa2019ac_spec]

used a single-layered bi-directional long-short term memory (LSTM)

[hochreiter1997lstm]. The acoustic features such as spectrograms usually have thousands of time steps, Nguyen et al. [Nguyen2020temporalsub] argued that the length of the captions is significantly less than the length of the audio features, making the captioning models difficult to learn the correspondence between words and audio features. They proposed a temporal sub-sampling method to sub-sample the audio features between the RNN layers, and showed that the temporal sub-sampling of audio features could benefit audio captioning methods.

The main advantages of employing RNNs as encoders are their simplicity and their ability to process sequential data. However, using RNNs alone as the encoder is not found to give strong performance. The reason might be that acoustic features are usually long sequences, RNNs may not be able to effectively model long-range time dependencies. In addition, getting an audio representation of fixed size from long hidden states also leads to excessive compression of information, making it difficult for the language decoder to generate fine-grained descriptions.

2.2.2 CNNs

CNNs have been applied with great success to the field of computer vision (CV)

[lecun2015deep]. In recent years, CNNs are adapted to audio-related tasks and show powerful ability in extracting robust audio patterns [hershey2017vggish, kong2020panns]. CNNs treat the spectrograms as 1-channel images, and model local dependencies within the spectrograms.

Many CNN models pre-trained on large audio datasets have been published. Many work directly employ pre-trained CNN models as the audio encoder. VGG-like CNNs [chen2020ac_CNN, Mei2021ac_trans] and ResNets [Ye2021peking, Han2021netease, Perez-Castanos2020ac_listen] are popular choices as these networks perform well on audio-related tasks such as audio tagging and sound event detection [kong2020panns]. In addition, 1-D CNN is also incorporated to better exploit temporal patterns. For example, Eren et al. [eren2020acfake] and Han et al. [Han2021netease] used Wavegram-Logmel-CNN adapted from PANNs [kong2020panns], which takes raw waveform for 1-D convolution and spectrogram for 2-D convolution and combine the outputs of 1-D convolutional layers and 2-D convolutional layers in deep layers. Tran et al. [tran2020wavetransformer]

also proposed to utilise 1-D and 2-D convolutions for extracting temporal and time-frequency information, however, they only used spectrogram as input and reshape it for 1-D convolution. To obtain global audio features, some methods use a global pooling after the last convolutional block to summarize feature maps into a vector of fixed size

[eren2020acfake], while some keep the time axis to get fine-grained temporal features and utilize an attention module to attend to the informative features when performing language decoding [Mei2021ac_trans, Ye2021peking].

CNNs outperform RNNs, and are now the dominant approach for audio encoding. The main advantages of CNNs are that they are invariant to time shift and good at modeling local dependencies within the spectrograms. However, CNNs have limited receptive fields, modeling long-range time dependencies for a long audio needs a deep CNN.

2.2.3 CRNNs

Motivated by the demand for modelling the local and long-range dependencies simultaneously, convolutional recurrent neural networks (CRNNs) [shi2016crnn], a combination of CNNs and RNNs, have also been applied as the audio encoder. In a CRNN, the RNN layers are introduced after the CNN layers to model the temporal relationship between extracted CNN features. Kim et al. [kim2019audiocaps] proposed a top-down multi-scale encoder where the features are extracted from two layers of the VGGish network [hershey2017vggish], that is, a fully connected layer for extracting the high-level semantic features and a convolutional layer for extracting the mid-level features. Those features are then encoded by a two-layer bi-directional LSTM where the semantic features are injected in the second layer. Takeuchi et al. [Takeuchi2020ntt_first] and Xu et al. [xu2020crnn_sjtu] both adopted a similar CRNN encoder without using multi-level features. Xu et al. [xu2021invest_cnn_crnn] compared CNN and CRNN encoders, they showed that CRNN outperformed CNN when the encoders are trained from scratch but CRNN brought little improvement when pre-training was applied. In general, CRNNs need more computation than CNNs but offer limited improvement.

2.2.4 Other approaches

Transformer and its variants have been probably the most popular models in the fields of NLP and CV since 2017

[vaswani2017attention]. Self-attention based encoders are also employed in recent works in audio captioning. Koizumi et al. [koizumi2020keywords] introduce a self-attention block after CNN layers in the encoder to learn the temporal relationship between CNN features. Mei et al. [Mei2021ACT] proposed Audio Captioning Transformer (ACT), where the encoder is a convolution-free Transformer and directly models the relationships between the split spectrograms. ACT shows comparable performance with CNN-based methods. In addition, convolution and self-attention can be combined as in [Narisetty2021ac_conformer] by leveraging a convolution-augmented Transformer (Conformer) [gulati2020conformer].

In summary, various neural network architectures have been investigated in order to obtain robust audio representations. Early works mainly adopted RNNs as encoders, and the trend shifts from RNNs to CNNs and CRNNs due to their superior performance. Novel attention-based architectures also show a strong competitiveness in learning robust audio features.

3 Text decoding

The aim of the language decoder is to generate a caption given audio features from the encoder. Existing works all adopt an auto-regressive model, where each predicted word is conditioned on previous predictions. In addition to the main decoder block, there is often a word embedding layer before the main decoder block, which embeds each input word into a fixed-dimension vector. In this section, we first introduce popular word embeddings and then discuss main text decoding approaches.

3.1 Word embeddings

Word embedding is a means of converting discrete words into vectors, thus text can be processed by the neural networks. The simplest method is called one-hot encoding, which embeds each word into a one-hot vector with the dimension equal to the size of the vocabulary. This method suffers from the curse of dimension and the loss of semantic relationship information, thus it is not widely used

[jurafsky2009slp2]. Pre-trained word embeddings are becoming popular in recent years. These pre-trained word embeddings are trained using neural networks with large corpus, and could capture semantic information, that is, semantically similar words are close to each other in the embedding space [mikolov2013word2vec]. Word2Vec [mikolov2013word2vec], GloVe [pennington2014glove] and fastText [mikolov2018fasttext] are widely used in existing audio captioning works [kim2019audiocaps, chen2020ac_CNN, xu2020crnn_sjtu, koizumi2020keywords, eren2020acfake]

. With the popularity of large pre-trained language models, Weck et al.

[Weck2021ac_offtheshelf] employed BERT [devlin2019bert] as a feature extractor to obtain word embeddings. They also compared the effect of different pre-trained word embeddings and found BERT embedding leads to the best performance, others such as Word2Vec and GloVe also provide slight improvement as compared to randomly initialized word embeddings.

3.2 Neural networks

Figure 2: Diagram of RNN-based language model.

3.2.1 RNNs

Sentences are also sequential data composed by discrete words, thus RNNs are popularly employed as the language decoder. Figure 2 shows a diagram of RNN-based language decoder. The audio feature sequence from the audio encoder is first aggregated to get a global feature representation and then passed to the RNN decoder for caption generation. Drossos et al. [drossos2017ac_1] proposed a 2-layer GRU as the decoder in the initial work. Due to the limited availability of data, many subsequent works have adopted single-layer RNNs, either GRU or LSTM [Ikawa2019ac_spec, kim2019audiocaps, Nguyen2020temporalsub, Cakir2020ac_multitask, Ye2021peking]. The main differences among these works are on how to interact with audio features from the encoder side. In a simple recipe, a global audio feature representation is obtained by applying mean pooling on the audio feature sequence extracted by the encoder, which is then used as the initial hidden state of the RNN decoder [wu2019audiocaption_hospital, xu2021audiocaption_car] or is injected to the RNN decoder at each time step [Nguyen2020temporalsub, eren2020acfake]. This simple mean pooling method for getting a global audio representation is widely used in audio tagging tasks to detect what audio events are present in the whole audio clip [kong2020panns]. However, this method does not consider the relationships between audio features and is unable to capture the fine-grained information about audio events. These fine-grained information could be important for caption generation. Attention mechanism has been employed to overcome this problem [drossos2017ac_1]. When generating a word at each time step, the RNN decoder can attend to the whole audio feature sequence and place more weights on the informative audio features. Thus, the global audio representation at each time step is a different combination of the whole audio feature sequence. In addition, to better exploit previously generated words, Ye et al. [Ye2021peking] introduced another attention module to attend to previously generated words at each time step.

RNNs with attention mechanism show reasonable performance in audio captioning and are widely used. The main disadvantage of RNNs is that they may be struggling to capture long-range dependencies between the generated words. Fortunately, the audio captions are usually short in length, thus the RNN decoders do not need to model very long-range dependencies.

Figure 3: Diagram of Transformer-based language model.

3.2.2 Transformers

Since Vaswani et al. [vaswani2017attention] proposed Transformer in 2017, the self-attention mechanism has quickly become the basic building block in large language models. Transformer-based models such as BERT [devlin2019bert], GPT [radford2019gpt2] and BART [lewis2020bart] show superior performance to RNNs and dominate the text-related tasks in the field of NLP. Transformer-based models are also employed in audio captioning works recently and achieve state-of-the-art performance. The Transformer decoder is a stack of blocks, in which each block consists of a masked self-attention module, a cross-attention module and a feed-forward layer module. Figure 3 shows a diagram of Transformer-based language decoder, the audio feature sequence can directly interact with the Transformer decoder through the cross-attention module without the need to generate a global audio feature representation. When generating a word at each time step, the masked multi-head attention module attends to the previously generated words to exploit context information, the output of the masked self-attention module then acts as queries and interacts with the audio feature sequence from the encoder that acts as keys and values in the cross-attention module. The standard Transformer decoder is widely used [chen2020ac_CNN, Mei2021ac_trans, Han2021netease, Mei2021ACT]. Due to the limited data available in audio captioning, many of these works use shallow Transformer decoders, usually only two layers, unlike NLP tasks where very deep Transformers are often used. Some modifications to the standard Transformer architecture were also investigated, Xiao et al. [xiao2022local] introduced an attention-free Transformer decoder that could capture local information within audio features.

In summary, Transformer-based decoders show state-of-the-art performance in audio captioning, and they are also computationally efficient compared with RNN-based decoders.

4 Auxiliary information

In addition to the standard encoder-decoder architecture, researchers have investigated the use of auxiliary information such as keywords or sentence information to guide the caption generation. In this section, we discuss the auxiliary information used in the literature.

Directly mapping audio signals to sentences is challenging, thus keywords are widely employed to solve the word-selection indeterminacy problem and guide the caption generation [koizumi2020keywords]. To get the keywords, Kim et al. [kim2019audiocaps] retrieve the nearest training audio clip from AudioSet, the largest audio event dataset, and convert its audio tagging labels as keywords. They then align these keywords with the audio features via an attention flow and then feed the output into the decoder. Some datasets may not have corresponding label information for each audio clip, and in this case, researchers first extract keywords or tags from human-annotated captions according to some rules such as frequency of the words and part-of-speech of the words [chen2020ac_CNN, koizumi2020keywords, Cakir2020ac_multitask, Han2021netease]. Different methods were investigated to make use of the keywords. Cakir et al. [Cakir2020ac_multitask] introduce a keyword decoder to detect keywords of an audio clip and is jointly trained with the audio captioning model. Chen et al. [chen2020ac_CNN] extract keywords from captions, and pre-train the audio encoder with an audio tagging task to enhance the ability of the encoder in learning robust audio patterns. Koizumi et al. [koizumi2020keywords] employ a keyword estimation branch after the encoder, and combine the keywords with audio features before passing them to the language decoder. Ye et al. [Ye2021peking] utilize multi-scale CNN features for keyword prediction. However, some researchers found that keywords might not really improve the system performance in some situations. Takeuchi et al. [Takeuchi2020ntt_first] found that keywords may not work well when the model was trained from scratch while Ye et al. [Ye2021peking] also claimed their model did not converge when only using keywords information. The accuracy of the keywords could be a bottleneck as wrong keywords might impact adversely on the captioning performance.

Sentence information is also investigated. Ikawa et al. [Ikawa2019ac_spec] introduce a ‘specificity’ term to measure the output text based on the amount of information it carries. The model is trained to generate captions whose ‘specificity’ are close to ground truth captions. Similarly, Xu et al. [xu2021audiocaption_car] introduce a sentence loss to generate captions with higher semantic similarity to their ground truth, where they employ a pre-trained language model BERT [devlin2019bert] to get the sentence embedding.

Although different auxiliary information was used to improve the caption generation process, these methods have not brought significant improvements and they may not work well for all datasets. In the DCASE challenge 2021, most teams still used the standard encoder-decoder model without using auxiliary information and achieved promising results. How to improve the AAC system with the auxiliary information still needs more investigation.

Reference Audio Encoder Language Decoder Key aspects
Drossos et al. [drossos2017ac_1] RNN RNN Attention
Wu et al. [wu2019audiocaption_hospital] RNN RNN NA
Xu et al. [xu2021audiocaption_car] RNN RNN Sentence similarity loss
Ikawa et al. [Ikawa2019ac_spec] RNN RNN ‘Specificity’ term
Kim et al. [kim2019audiocaps] CNN(VGGish)+RNN RNN Multi-scale features, Semantic attention
Nguyen et al. [Nguyen2020temporalsub] RNN RNN Temporal subsampling
Cakir et al. [Cakir2020ac_multitask] RNN RNN Multi-task learning (keywords)
Perez-Castanos et al. [Perez-Castanos2020ac_gamma] CNN RNN Attention
Chen et al. [chen2020ac_CNN] CNN Transformer Pre-trained encoder
Xu et al. [xu2020crnn_sjtu] CRNN RNN Reinforcement learning
Takeuchi et al. [Takeuchi2020ntt_first] CNN+RNN RNN Keywords, sentence length estimation
Tran et al. [tran2020wavetransformer] CNN Transformer 1-D and 2-D CNN
Eren et al. [eren2020acfake] CNN(PANNs)+RNN RNN Keywords
Koizumu et al. [koizumi2020keywords] CNN(VGGish)+Transformer Transformer Keywords
Koizumu et al. [koizumi2020ac_gpt2] CNN(VGGish) GPT-2+Transformer GPT-2, similar captions retrieval
Xu et a. [xu2021invest_cnn_crnn] CNNCRNN RNN

Attention, transfer learning

Mei et al. [Mei2021ac_trans] CNN(PANNs) Transformer Transfer learning and reinforcement learning
Mei et al. [Mei2021ACT] Transformer Transformer

Full transformer network

Han et al. [Han2021netease] CNN(PANNs) Transformer Weakly-supervised pre-training, keywords
Ye et al. [Ye2021peking] CNN(PANNs) RNN Keyword, attention
Gontier et al. [Gontier2021ac_bart] CNN(VGGish) BART YAMNet tags, BART
Narisetty et al. [Narisetty2021ac_conformer] CNN(PANNs)+Conformer Transformer+RNN ASR techniques
Liu et al. [Liu2021cl4ac] CNN(PANNs) Transformer Contrastive learning
Won et al. [Won2021ac] CNN(PANNs) Transformer Transfer learning
Berg et al. [Berg2021continual] CNN Transformer Continual learning
Weck et al. [Weck2021ac_offtheshelf] CNN(VGGish,YAMNet,OpenL3,COALA) Transformer Transfer learning
Mei et al. [mei2021diverse] CNN(PANNs) Transformer GAN, diversity
Xiao et al. [xiao2022local] CNN Transformer Attention-free Transformer
Liu et al. [liu2022leveraging] CNN(PANNs) BERT Transfer learning, BERT
Chen et al. [chen2022contrastive] CNN Transformer Transfer learning, contrastive learning
Table 1: An overview of published methods for audio captioning.

5 Training strategies

Supervised training with a cross-entropy (CE) loss is a standard recipe for training an audio captioning model. The main drawback of this setting is that it may cause ‘exposure bias’ due to the discrepancy between training and testing [rennie2017exposurebias]. Reinforcement learning is introduced to solve this problem and directly optimize evaluation metrics. In addition, transfer learning has been widely used to overcome the data scarcity problem. In this section, we discuss the popular training strategies used in the literature.

5.1 Cross-entropy training

The cross-entropy loss with maximum likelihood estimation (MLE) is widely used for training audio captioning models. During training, this approach adopts a ‘teacher-forcing’ strategy [rennie2017exposurebias]. That is, the objective of training is to minimize the negative log-likelihood (equivalent to maximizing the log-likelihood) of current ground truth word given previous ground truth words at each time step, which can be formulated as follows:


where is the ground truth word at time step , is the length of the ground truth caption, is the input audio clip, and are the parameters of the audio captioning model.

The models trained via the cross-entropy loss can generate syntactically correct sentences and achieve high scores in terms of the evaluation metrics [Mei2021ac_trans]. However, there are also some disadvantages. First, the ‘teacher forcing’ strategy brings the problem known as ‘exposure bias’ [rennie2017exposurebias], that is, each word to be predicted is conditioned on previous ground-truth words in the training stage, while it is conditioned on previous output words in the test stage. This discrepancy leads to error accumulation during text generation in the test stage. Second, models tend to generate generic and simple captions even though each audio clip has multiple diverse human-annotated captions in the training set [mei2021diverse]. This is because the MLE training tends to encourage the use of highly frequent words appearing in the ground truth captions.

5.2 Reinforcement learning

Xu et al. [xu2020crnn_sjtu] employ a reinforcement learning approach to solve the ‘exposure bias’ problem and directly optimize the non-differentiable evaluation metrics. In a reinforcement learning setting, the captioning model is regarded as an agent and a policy is determined by the model’s parameters. The agent executes an action at each time step to sample a word according to the policy. Once a sentence is generated, the agent receives a reward of the generated sentence. The goal of training is to optimize the model to maximize the expected reward that could be any evaluation metrics. This can be formalized as minimizing negative expected reward:


where is a sampled caption from the model, is the reward of the sampled caption and are the model parameters. The caption can be sampled via Monte-Carlo sampling [sutton2018reinforcement], however, it is computationally expensive. Another computationally efficient method, self-critical sequence training (SCST) [rennie2017exposurebias] is generally employed. SCST employs the reward of a sentence sampled by greedy search as baseline thus avoids learning an estimate of expected future rewards. The expected gradient with respect a single sample caption can be approximated as:


where is the reward of a caption generated by the current model using a greedy search.

Reinforcement learning could substantially improve the scores of the evaluation metrics, although it is used to optimize only one metric. However, Mei et al. [Mei2021ac_trans] found that reinforcement learning may impact adversely on the quality of generated captions such as introducing repetitive words and incorrect syntax. This also implies that existing evaluation metrics may not correlate well with human judgements.

5.3 Transfer learning

The availability of audio captioning datasets is limited due to the challenging and time consuming process in data collection and annotation. To overcome the data scarcity problem, transfer learning is widely adopted. In the encoder of the captioning system, pre-trained audio neural networks such as VGGish [hershey2017vggish] and PANNs [kong2020panns], are widely used to initialize the parameters of encoders [Mei2021ac_trans, Ye2021peking, Han2021netease]. Xu et al. [xu2021invest_cnn_crnn] investigated the impact of pre-training on the audio captioning performance. They show that the audio encoders pre-trained with audio tagging task give the best performance. In addition, Weck et al. [Weck2021ac_offtheshelf] compare four off-the-shelf audio networks. In all the cases, pre-trained audio encoders substantially improve the performance of the audio captioning system. In the language decoder, although a lot of pre-trained Transformer-based language models have been released in recent years, most of those models cannot be directly used as the language decoder, since the decoder needs to interact with audio features from the encoder via a cross-attention module. Koizumi et al. [koizumi2020ac_gpt2] utilize the GPT-2 [radford2019gpt2] to get word embeddings. Gontier et al. [Gontier2021ac_bart] fine-tune BART [lewis2020bart] conditioned on the pre-trained audio embeddings and tags to generate captions and achieve state-of-the-art performance. To leverage pre-trained BERT [devlin2019bert], Liu et al. [liu2022leveraging] investigate to add cross-attention layers with randomly initialized weights in the pre-trained BERT models as the decoder, they demonstrate the efficacy of the pre-trained BERT models for audio captioning.

In summary, pre-trained audio encoders are proved to be effective to get robust audio features and overcome the data scarcity problem, while how to incorporate existing large pre-trained language models into an audio captioning system still needs further investigation.

5.4 Other approaches

To overcome the data scarcity problem, Liu et al. [Liu2021cl4ac] and Chen et al. [chen2022contrastive] both investigated using contrastive training to learn better alignment between audio and text by constructing contrastive samples during training. Bert et al. [Berg2021continual] presented a continual learning approach for continuously adapting an audio captioning method to new unseen general audio signals without forgetting learned information. Mei et al. [mei2021diverse]

argued that an audio captioning system should have the ability to generate diverse captions for a given audio clip or across similar audio clips like human beings. They proposed an adversarial training framework based on generative adversarial network (GAN)

[yu2017seqgan] to encourage the diversity of audio captioning systems. In addition, Narisetty et al. [narisetty2022joint] proposed approaches for end-to-end joint modeling of speech recognition and audio captioning tasks.

A brief overview of the published audio captioning methods is shown in Table LABEL:tab:methods_statistics. Table LABEL:tab:methods_statistics contains the type of deep neural networks used to encode audio information, and the language models used to generate captions, and the final column shows the key aspects of these methods.

6 Evaluation metrics

Evaluating audio captions is a challenging and subjective task, because an audio clip can correspond to several correct captions that may use different words, grammar, and/or describe different parts of the audio clip. Existing works adopt the same evaluation metrics used in image captioning, where most of these metrics are borrowed from NLP tasks such as machine translation and summarization, and the remaining are designed specifically for image captioning. The automatic evaluation metrics compare the machine-generated captions with human-annotated references where the number of references for each audio clip may vary across different datasets. In this section, we first introduce the conventional rule-based metrics, and then discuss some novel model-based metrics.

6.1 Conventional evaluation metrics

BLEU. BLEU (BiLingual Evaluation Understudy) [papineni2002bleu] is originally designed to measure the quality of machine-generated sentences for machine translation systems. BLEU calculates modified -gram precision for up to four: the counts for -grams in candidate sentence is first collected and clipped by their corresponding maximum count in references, the clipped counts are then summed and divided by the total number of candidate -grams, where the -gram is a window consisting of consecutive words. The modified -gram precisions are averaged with uniform weights to account for both adequacy and fluency, where -gram matches account for adequacy while longer -gram matches account for fluency. As precision tends to give short sentences higher scores, a brevity penalty is introduced to penalize short sentences.

ROUGE. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) [lin2004rouge]

includes a set of metrics proposed to measure the quality of a machine-generated summary


. ROUGE-L based on longest common subsequence is widely used in image and audio captioning. ROUGE-L first counts the length of a longest common subsequence between a candidate and a reference, which is then divided by the total lengths of the candidate and reference to get precision and recall respectively. An F-measure combining precision and recall is then calculated as the score of ROUGE-L, which favors more on recall.

METEOR. METEOR (Metric for Evaluation of Translation with Explicit ORdering) [banerjee2005meteor] is also a metric to evaluate machine translation systems. METEOR calculates unigram precision and unigram recall based on an explicit word-to-word matching in terms of their surface forms, stemmed forms, and meanings between a candidate and one or more references. An F-mean placed most of the weight on recall is then computed. To take into account for longer matches, unigrams that are in adjacent positions in candidate and references are grouped into chunks, a penalty based on chunks is introduced and combined with F-mean to give the final METEOR score.

CIDEr. CIDEr (Consensus-based Image Description Evaluation) [vedantam2015cider] is an automatic consensus metric for evaluating image description quality. CIDEr also represents sentences using -grams presented in them, where each -gram is weighted by the term frequency inverse document frequency (TF-IDF) weights because

-grams that commonly occur in a dataset are likely to be non-informative. CIDEr computes the cosine similarity of weighted

-grams between candidate and references, which accounts for both precision and recall. Similar to BLEU, CIDEr considers higher order -grams (up to four) to capture grammatical properties and richer semantics.

SPICE. SPICE (Semantic Propositional Image Caption Evaluation) [anderson2016spice]

is an image captioning evaluation metric based on semantic content matching. SPICE parses both candidate and references into scene graphs in which the objects, attributes and relations are encoded. An F-score is then calculated based on the matching of tuples extracted from the candidate and reference scene graphs. SPICE ignores the properties of grammar and fluency of sentences but just focuses on semantic matching.

SPIDEr. SPIDEr [liu2017spider] is proposed for evaluating image captions and used as the official ranking metric in the automatic audio captioning task in DCASE Challenge. SPIDEr is the average of SPICE and CIDEr: the SPICE score ensures captions are semantically faithful to the content, while CIDEr score ensures captions are syntactically fluent.

6.2 Model-based metrics

BERTScore. BERTScore [Zhang2020BERTScore] is an evaluation metric for text generation tasks. Unlike conventional metrics which almost rely on surface-form similarity, BERTScore utilizes pre-trained BERT [devlin2019bert] contextual embeddings that can capture semantic similarity, distant dependencies and ordering. After getting contextual embedding of each token through BERT, BERTScore measures similarity of each token between candidate and references through cosine similarity where each token is matched to the most similar token in the other sentences. The matched token pairs are used to calculate a precision, recall and an F1 measure. Importance weighting with inverse document frequency is also introduced to weight more for rare words.

SentenceBERT. SentenceBERT [reimers2019sentenceBERT] is not essentially an evaluation metric but a modification of the pre-trained BERT [devlin2019bert]. SentenceBERT can be used to obtain fixed-sized embeddings for input sentences. The sentence embeddings are then used to calculate a similarity score, such as cosine similarity, Euclidean distance or other similarities. This enables SentenceBERT to be used for audio captioning for similarity comparison between candidate and reference captions from a sentence perspective.

FENSE. FENSE (Fluency ENhanced Sentence-bert Evaluation) [zhou2021fense] is a model-based evaluation metric specifically proposed for audio captioning. FENSE utilizes the Sentence-BERT to derive sentence embeddings for candidate and reference captions, and calculates its average cosine similarity score. To capture grammar issues like repeated words or phrases and incomplete sentences, FENSE uses a separate pre-trained error detector to penalize the Sentence-BERT scores when fluency issues are detected.

Kim et al. [kim2019audiocaps] 0.614 0.446 0.203 0.593 0.144
Koizumi et al. [koizumi2020ac_gpt2] 0.638 0.458 0.199 0.603 0.139
AudioCaps Mei et al. [Mei2021ACT] 0.647 0.488 0.222 0.679 0.160
Xu et al. [xu2021invest_cnn_crnn] 0.655 0.476 0.229 0.66 0.168
Liu et al. [liu2022leveraging] 0.671 0.498 0.232 0.667 0.172
Gontier et al. [Gontier2021ac_bart] 0.699 0.523 0.241 0.753 0.176
Eren et al. [eren2020acfake] 0.710 0.490 0.290 0.750 -
Cakir et al. [Cakir2020ac_multitask] 0.409 0.156 0.088 0.107 0.040
Nguyen et al. [Nguyen2020temporalsub] 0.417 0.154 0.089 0.093 0.040
Perez-Castanos [Perez-Castanos2020ac_listen] 0.469 0.265 0.136 0.214 0.086
Tran et al. [tran2020wavetransformer] 0.489 0.303 0.143 0.268 0.095
Takeuchi et al. [Takeuchi2020ntt_first] 0.512 0.325 0.343 0.290 0.089
Koizumi et al. [koizumi2020keywords] 0.521 0.309 0.149 0.258 0.097
Chen et al. [chen2020ac_CNN] 0.534 0.343 0.160 0.346 0.108
Xu et al. [xu2020crnn_sjtu] 0.561 0.341 0.162 0.338 0.108
Clotho Narisetty et al. [Narisetty2021ac_conformer] 0.541 0.346 0.161 0.362 0.110
Liu et al. [Liu2021cl4ac] 0.553 0.349 0.168 0.368 0.115
Xu et al. [xu2021invest_cnn_crnn] 0.556 0.363 0.169 0.377 0.115
Chen et al. [chen2022contrastive] 0.572 0.379 0.171 0.407 0.119
Xiao et al. [xiao2022local] 0.578 0.387 0.177 0.434 0.122
Won et al. [Won2021ac] 0.564 0.376 0.177 0.441 0.128
Han et al. [Han2021netease] 0.583 0.391 0.179 0.456 0.128
Eren et al. [eren2020acfake] 0.590 0.350 0.220 0.280 -
Ye et al. [Ye2021peking] 0.648 - 0.190 0.491 0.131
Mei et al. [Mei2021ac_trans] 0.634 0.423 0.187 0.476 0.134
Table 2: Performances of audio captioning methods on two main datasets.

7 Datasets

The release of high quality audio captioning datasets has greatly promoted the development of this area. Existing datasets differ in many aspects such as the number of audio clips, the number of captions per audio clip, and the length of each audio clip. These different characteristics will affect the design and the performance of the audio captioning model. We describe the details of existing datasets in this section. To better understanding the datasets, we then use consensus score of previously introduced metrics to evaluate these datasets.

7.1 Datasets description

AudioCaps. AudioCaps [kim2019audiocaps] is the largest audio captioning dataset so far. All the audio clips are 10-seconds long and are sourced from AudioSet, a large-scale audio event dataset [audioset]. The audio clips are selected by following some selection qualifications that ensure the chosen audio clips are balanced with respect to audio tags and diverse in terms of content. The audio clips are annotated by crowdworkers through Amazon Mechanical Turk (AMT), annotators are provided with an audio clip with corresponding word hints and video hints, and are required to write a natural language description with provided information.

The official release of AudioCaps contains 51k audio clips and is divided into a training set, a validation set and a test set. Each audio clip in the training set contains one corresponding human-annotated caption while those in validation set and test set contain five corresponding captions. As audio clips from AudioSet are extracted from YouTube videos, it is worth noting that some audio clips might be no longer downloadable, thus the number of downloadable audio clips might be varied from the official release of AudioCaps. The statistics in Table LABEL:tab:datasets_statis are reported based on the official release version of AudioCaps.

# of
# of captions
per audio
AudioCaps 1, 5 10 s 5066 8.79
Clotho 5 15-30 s 4365 11.33
MACS 2, 3 ,4, 5 10 s 2776 9.24
Table 3: An overview of English-annotated datasets.

Clotho. Clotho [drossos2020clotho] is the official ranking dataset used in the task 6 (Automated Audio Captioning) of DCASE challenges in 2020 and 2021. All the audio clips are sourced from the online platform Freesound [font2013freesound] and are almost uniformly ranged from 15 to 30 seconds. Annotators are employed through AMT for crowdsourcing the captions. During the annotation process, no other information such as word hints and video hints in AudioCaps but only the audio signal was available to the annotators, as these auxiliary information may introduce a bias.

The latest Clotho v2 contains audio clips in the training set and audio clips in the validation and evaluation set, respectively. Each audio clip contains five human-annotated captions, ranging from 8 to words long. In the DCASE challenges, all these three published sets are allowed to be used to train the models, and the final performances were assessed using a preserved testing split by the organisers. For reporting results for conference or journal papers, performances are assessed using the published evaluation set and some authors may include the validation set into training. As a result, the model performance reported on Clotho may not be in the same ground.

AudioCaps 0.65 0.48 0.37 0.29 0.49 0.28 0.90 0.21 0.56 0.52 0.64
Clotho 0.65 0.49 0.38 0.31 0.50 0.30 0.86 0.23 0.54 0.54 0.53
MACS 0.49 0.28 0.16 0.08 0.32 0.18 0.21 0.13 0.17 0.24 0.52
Table 4: Consensus scores of English-annotated datasets.

MACS. MACS (Multi-Annotator Captioned Soundscapes) [Martin2021databias] consists audio clips from the development set of TAU Urban Acoustic Scenes 2019 dataset [Mesaros2018TAU]. The audio clips are all 10-seconds long recorded from three acoustic scenes (airport, public square and park) and are annotated by students. The annotation process contains two stages. Given a list of ten classes and an audio clip, the annotators are first required to select the audio events presented in an audio clip from the given class list. Afterwards, the annotators are required to write a description of the audio clip.

MACS contains audio clips without being split into subsets. The number of captions per audio clip varies in the dataset. Most audio clips have five corresponding human-annotated captions, while some of them may only have two, three or four.

AudioCaption AudioCaption is a domain-specific Mandarin-annotated audio captioning dataset. Two scene-specific sets have been published: one for hospital scene [wu2019audiocaption_hospital] and another for car scene [xu2021audiocaption_car]. The hospital-scene set contains audio clips with three captions per clip while the car-scene set contains audio clips with five captions per clip. All the audio clips are annotated by native Mandarin speakers.

7.2 Datasets evaluation

Since all the datasets mentioned above are annotated under different protocols, these datasets show different characteristics such as the number of captions per audio clip, caption lengths and sample variance in multi-reference captions. We believe these characteristics will influence the design and performance of audio captioning models. To better understand the datasets, we evaluate three English-annotated datasets from different aspects. Table

LABEL:tab:methods_scores reports the performance of published methods on two main datasets. Table LABEL:tab:datasets_statis summarizes the datasets with some basic statistics. In addition, we use a consensus score [zhu2020consensus] to represent the agreement among the parallel reference captions for the same audio clip, and the results are shown in Table LABEL:tab:datasets_consensus. The consensus score among parallel reference captions for an audio clip is defined as:


where is the -th caption and the metric can be anyone mentioned above. Since the number of references are varied among different datasets, we report the consensus score of AudioCaps using validation and test set, Clotho using training set and MACS using all the audio clips having five reference captions.

As the consensus scores are computed among the human-annotated captions, they can be also regarded as upper bound human-level performance on each dataset. As can be seen from Table LABEL:tab:datasets_consensus, the consensus scores on AudioCaps and Clotho are close to each other except that the SentenceBERT score on AudioCaps is clearly higher than that of Clotho. Surprisingly, the consensus scores on MACS are lower than the other two datasets while only the SentenceBERT is close to them. This may reveal that the human-annotated captions in MACS are more diverse than the other two, and SentenceBERT can better capture semantic relevance between diverse captions. The consensus scores can be regarded as a measure of the dataset quality to some extent.

8 Challenges and future directions

Many deep learning-based methods have been proposed to improve automated audio captioning systems, and this task has seen rapid progress in recent years. However, there is still a large gap between the performance of the resulting systems and human level performance. In this section, we discuss challenges remaining in this area and envisage possible future research directions.

8.1 Data

There are several challenges about data for audio captioning. First, the data scarcity problem is still a main challenge. Existing datasets are limited in size. The collection of an audio captioning dataset is time consuming, and it is hard to control the quality of human-annotated captions. Han et al. [Han2021netease] collect weakly labeled dataset from online available sources to pre-train the AAC model and show that more training data (even weakly-labelled) can greatly improve the system performance. This reveals that we can make use of online available audio clips with their weakly-labelled text description to learn more robust audio-text representation, such as CLIP [radford2021clip] in computer vision.

Second, existing datasets usually do not cover all possible real-life scenarios, and thus, audio captioning models cannot generalize well to different contexts. Martin et al. [Martin2021databias] investigate dataset bias of existing datasets from a lexical perspective. The bias problem still needs more investigation, e.g. how it will influence the model performance.

8.2 Model and training strategies

Existing AAC methods all follow the encoder-decoder paradigm and generate sentences in an auto-regressive manner. These two techniques have been the standard recipe for audio captioning models. Nonetheless, novel methods should be investigated in future research. For example, BERT-like architectures which fuse acoustic and textual modalities in early stage can be a replacement for the encoder-decoder paradigm, and work well in image captioning [zhou2020unified, li2020oscar]. Non-auto-regressive language models could reduce the inference time by generating all words in parallel [gao2019non_auto], which might be a worthwhile research direction.

For the training strategies, standard cross-entropy loss brings the problem of ‘exposure bias’ and tends to generate simple and generic captions. Although reinforcement learning is introduced to solve this problem, reinforcement learning may impact adversely on the quality of generated captions. A promising exploration line is to design new objective functions or add human feedback in a reinforcement learning setting to solve these problems. In addition, how to make use of learned knowledge in large pre-trained language models to help caption generation needs more investigation.

8.3 Evaluation

The performance of audio captioning systems is generally assessed by comparing machine-generated captions with human-annotated reference captions. Each test audio clip is provided with multiple references to capture the possible variations as an audio clip can be described using distinct words and grammars from different perspectives. First, the multiple references cannot capture all possible variations, thus a reasonable caption that is consistent with the content of the given audio clip but uses different words with the references may receive a low score. Second, as argued in [Mei2021ac_trans], existing evaluation metrics do not correlate well with human judgements. Incorporating human evaluation is time-consuming and expensive, future work should figure out to what extent that existing metrics correlate with human judgements, and develop more robust evaluation metrics.

8.4 Diversity and stylized captions

As argued in [dai2017im_diverse], a good captioning model should generate sentences that possess three properties: fidelity, the generated captions should reflect the audio content faithfully, naturalness: the captions cannot be distinguished as machine-generated, diversity: the sentences should have rich and varied expressions like different people would describe an audio clip in different ways. However, many existing approaches only consider semantic fidelity. Further research should be conducted to improve the other two properties. In addition, stylized captioning system should be a worthwhile research direction, where the captioning systems can be used for different audiences such as kids.

8.5 Potential directions

There are also other potential directions for audio captioning. For example, temporal information of the sound events is not well used in existing works, future work could investigate the use of information related to activities and timing information of sound events to generate more accurate captions. Information from other modalities could be also employed to train the audio captioning models, such as using audio-visual captioning methods [tian2018attempt, iashin2020better]. In addition, audio captioning can be potentially linked with other audio-language multi-modal tasks, such as audio-text retrieval [koepke2022audioretrieval], audio question answering [fayek2020temporal], text-based audio generation [liu2021conditional] and text-based audio source separation [liu2022separate].

9 Conclusion

Audio captioning is a fast developing task involving both audio processing and natural language processing. In this paper, we have reviewed published audio captioning methods from the perspective of audio encoding and text decoding. We discussed auxiliary information employed to guide the caption generation, and training strategies adopted in the literature. In addition, main evaluation metrics and datasets are reviewed. We briefly outlined challenges and potential research directions in this area. We hope this survey can serve as a comprehensive introduction to audio captioning and encourage novel ideas for future research.


For the purpose of open access, the authors have applied a creative commons attribution (CC BY) licence to any author accepted manuscript version arising.


This work is partly supported by grant EP/T019751/1 from the Engineering and Physical Sciences Research Council (EPSRC), a Newton Institutional Links Award from the British Council, titled “Automated Captioning of Image and Audio for Visually and Hearing Impaired” (Grant number 623805725), and a Research Scholarship from the China Scholarship Council (CSC) No.202006470010.


AAC: Automated audio captioning, NLP: Natural language processing, ASR: Automatic speech recognition, RNN: Recurrent neural network, CNN: Convolutional neural network, STFT: Short time Fourier transformer, MFCCs: Mel-frequency cepstral coefficients, DCT: Discrete cosine transform, GRU: Gated recurrent unit, LSTM: Long-short term memory, CV: Computer vision, CRNN: Convolutional recurrent neural network, CE: Cross-entropy, MLE: Maximum likelihood estimation, GAN: Generative adversarial network.

Availability of data and materials

The datasets analysed during this article are available on the internet.

Authors’ contributions

XM was a major contributor in writing the manuscript. XL summarized challenges and future work. MP and WW substantively revised the manuscript. All authors read and approved the final manuscript.

Competing interests

WW is an editorial board member of EURASIP Journal on Audio Speech and Music Processing and also a guest editor of the special issue ”Recent Advances in Computational Sound Scene Analysis”, other authors declare that they have no competing interests.

Authors’ information

The authors are with Center for Vision Speech and Signal Processing (CVSSP), Department of Electrical and Electronic Engineering, Faculty of Engineering and Physical Sciences, University of Surrey, Guildford, GU2 7XH, UK.