Sound is ubiquitous in our daily lives. It carries a wealth of information about the environment, from sound scenes to individual events happening around us. For most people, the ability to perceive and understand the everyday sounds around us is taken for granted. However, digging out helpful information from sounds is a challenging task for machines. With the development of machine learning, the field of machine listening has attracted increasing attention, with significant progress made in recent years, in areas such as audio tagging[audioset, Xu2017unsupervised, kong2019weakly, Wang2020modelling], sound event detection [kong2019sed, kong2020sed_weakly, mesaros2021sound]
and acoustic scene classification[barchiesi2015acoustic, Wang2021sa++]. However, in these areas, the focus has been mostly on identifying acoustic scenes or events in an audio clip, rather than considering relationships between the audio events and acoustic scenes.
Automated audio captioning (AAC) aims at describing the content of an audio clip using natural language. This is a cross-modal translation task at the intersection of audio processing and natural language processing (NLP) [drossos2017ac_1]
. Compared with automatic speech recognition (ASR), audio captioning focuses only on the environmental sounds, rather than the voice content that may be present in an audio clip. Compared with other popular audio-related tasks, audio captioning not only needs to determine what audio events are present, but also needs to capture the spatial-temporal relationships between these audio events and express them in natural language. An example caption may be “a person was walking on a sidewalk adjacent to a school where children were playing on the playground”111This caption is from the Clotho dataset. which describes the scenes and sound events given an audio clip.
Audio captioning has practical potential for various applications such as helping the hearing-impaired to understand environmental sounds, and analyzing sounds for video-based security surveillance systems. In addition, audio captioning can be used for multimedia retrieval [koepke2022audioretrieval], which can be applied in areas including education, film production, and web searching.
Unlike image and video captioning, which have been widely studied for almost a decade, audio captioning is a relatively new task that has been studied since 2017 [drossos2017ac_1]. In the past three years, this field has received increasing attention due to freely available datasets released and being held as a task in DCASE222http://dcase.community/ Challenges in 2020 and 2021. A number of papers about audio captioning have been published, with deep learning being a popular method. Specifically, the encoder-decoder framework has been adopted as a standard recipe for solving this cross-modal translation task. In this method, the encoder extracts audio features from the input audio clips, and the decoder generates captions based on the extracted audio features. Figure 1
shows an overview of an encoder-decoder based AAC system. Analyzing audio largely depends on obtaining robust audio features, different kinds of neural networks have been investigated as the encoders, such as Recurrent Neural Networks (RNNs)[rumelhart1986rnn]
, Convolutional Neural Networks (CNNs)[lecun2015deep] and Transformers [vaswani2017attention]. For the decoder, RNNs and Transformers are usually employed, inspired by works in NLP. In addition to the encoder-decoder framework, auxiliary information such as keywords or sentence information [koizumi2020keywords, xu2021audiocaption_car], attentive approaches [kim2019audiocaps, drossos2017ac_1] and different training strategies [xu2021_sjtu, Berg2021continual, Liu2021cl4ac] have been proposed to improve the performance of captioning systems. However, there is still a large gap between achieved results and human level performance [kim2019audiocaps].
To the best of our knowledge, no survey papers on audio captioning have been published so far. In this paper, we aim to provide a comprehensive overview of audio captioning with the hope of stimulating novel research ideas. The articles published by February 2022 in the literature are considered in our survey. The encoder-decoder framework has been a standard recipe for AAC systems, therefore, we develop a taxonomy of acoustic encoding and text decoding approaches.
This paper is organized as follows. In Section 2 and Section 3, we discuss acoustic encoding and text decoding approaches respectively. Auxiliary information is discussed in Section 4. We discuss training strategies adopted in the literature in Section 5. Furthermore, we review popular evaluation metrics and main datasets in Section 6 and Section 7 respectively. Finally, we discuss some open challenges and future research directions in Section 8 and briefly conclude this paper in Section 9.
2 Acoustic encoding
Analyzing the content of an audio clip largely depends on obtaining an effective feature representation for it, which is the aim of the encoder in an AAC system. The time domain waveforms are lengthy 1-D signals, it is challenging for machines to directly identify sound events or sound scenes from raw waveforms [virtanen2018computational]. Current popular approaches for acoustic encoding consist of two steps, first extracting acoustic features, and then passing them into an encoder to obtain compact audio features. In this section, we first discuss popular acoustic features used in literature, then audio encoding approaches, focusing on these based on deep neural networks.
2.1 Acoustic features
Time-frequency representations, such as spectrograms, are widely used as the acoustic features. To obtain a spectrogram, an audio signal is first split into short frames, as audio segment with a length of around 20-60 ms is usually regarded as quasi-stationary [virtanen2018computational]. Then a window function is applied at each frame to enforce continuity and avoid spectral leakage [harris1978leakage]
. After that, short time Fourier transform (STFT) is calculated for each time frame to get the spectrogram. The spectrogram is a 2-D representation whose horizontal axis is time and vertical axis is frequency, the value at each point of the spectrogram represents the energy at a specific time and frequency. Inspired by the selectivity of human auditory system to different frequencies, the frequency axis of a spectrogram may be converted to different scales, resulting in representations such as mel-spectrogram, log mel-spectrogram, and constant-Q transform[virtanen2018computational]. The log mel-spectrogram is popular due to its superior performance compared with other features [kong2020panns, hershey2017vggish, wu2019audiocaption_hospital]. In addition, mel-frequency cepstral coefficients (MFCCs) were used in some early works [kim2019audiocaps, Ikawa2019ac_spec]
. MFCCs are calculated by applying a discrete cosine transform (DCT) on log mel-spcetrograms. Compared with time-frequency representations, MFCCs contain less information and are only able to estimate the global spectral shape of an audio clip[virtanen2018computational], thus MFCCs are rarely used in recent works.
2.2 Neural networks
RNNs are designed to process sequential data with variable lengths [rumelhart1986rnn]. Audio is time series signal, therefore RNNs are naturally adopted as encoders in initial works. In a simple recipe, a RNN is used to model temporal relationships between acoustic features, and the hidden states of the last layer of the RNN is regarded as the audio feature sequence. Drossos et al. [drossos2017ac_1]
utilized a three-layered bi-directional gated recurrent unit (GRU)[chung2014gru] as the encoder. Further, unlike using multi-layer RNNs, Xu et al. [xu2021audiocaption_car] and Wu et al. [wu2019audiocaption_hospital] used a single-layered uni-directional GRU while Ikawa et al. [Ikawa2019ac_spec]
used a single-layered bi-directional long-short term memory (LSTM)[hochreiter1997lstm]. The acoustic features such as spectrograms usually have thousands of time steps, Nguyen et al. [Nguyen2020temporalsub] argued that the length of the captions is significantly less than the length of the audio features, making the captioning models difficult to learn the correspondence between words and audio features. They proposed a temporal sub-sampling method to sub-sample the audio features between the RNN layers, and showed that the temporal sub-sampling of audio features could benefit audio captioning methods.
The main advantages of employing RNNs as encoders are their simplicity and their ability to process sequential data. However, using RNNs alone as the encoder is not found to give strong performance. The reason might be that acoustic features are usually long sequences, RNNs may not be able to effectively model long-range time dependencies. In addition, getting an audio representation of fixed size from long hidden states also leads to excessive compression of information, making it difficult for the language decoder to generate fine-grained descriptions.
CNNs have been applied with great success to the field of computer vision (CV)[lecun2015deep]. In recent years, CNNs are adapted to audio-related tasks and show powerful ability in extracting robust audio patterns [hershey2017vggish, kong2020panns]. CNNs treat the spectrograms as 1-channel images, and model local dependencies within the spectrograms.
Many CNN models pre-trained on large audio datasets have been published. Many work directly employ pre-trained CNN models as the audio encoder. VGG-like CNNs [chen2020ac_CNN, Mei2021ac_trans] and ResNets [Ye2021peking, Han2021netease, Perez-Castanos2020ac_listen] are popular choices as these networks perform well on audio-related tasks such as audio tagging and sound event detection [kong2020panns]. In addition, 1-D CNN is also incorporated to better exploit temporal patterns. For example, Eren et al. [eren2020acfake] and Han et al. [Han2021netease] used Wavegram-Logmel-CNN adapted from PANNs [kong2020panns], which takes raw waveform for 1-D convolution and spectrogram for 2-D convolution and combine the outputs of 1-D convolutional layers and 2-D convolutional layers in deep layers. Tran et al. [tran2020wavetransformer]
also proposed to utilise 1-D and 2-D convolutions for extracting temporal and time-frequency information, however, they only used spectrogram as input and reshape it for 1-D convolution. To obtain global audio features, some methods use a global pooling after the last convolutional block to summarize feature maps into a vector of fixed size[eren2020acfake], while some keep the time axis to get fine-grained temporal features and utilize an attention module to attend to the informative features when performing language decoding [Mei2021ac_trans, Ye2021peking].
CNNs outperform RNNs, and are now the dominant approach for audio encoding. The main advantages of CNNs are that they are invariant to time shift and good at modeling local dependencies within the spectrograms. However, CNNs have limited receptive fields, modeling long-range time dependencies for a long audio needs a deep CNN.
Motivated by the demand for modelling the local and long-range dependencies simultaneously, convolutional recurrent neural networks (CRNNs) [shi2016crnn], a combination of CNNs and RNNs, have also been applied as the audio encoder. In a CRNN, the RNN layers are introduced after the CNN layers to model the temporal relationship between extracted CNN features. Kim et al. [kim2019audiocaps] proposed a top-down multi-scale encoder where the features are extracted from two layers of the VGGish network [hershey2017vggish], that is, a fully connected layer for extracting the high-level semantic features and a convolutional layer for extracting the mid-level features. Those features are then encoded by a two-layer bi-directional LSTM where the semantic features are injected in the second layer. Takeuchi et al. [Takeuchi2020ntt_first] and Xu et al. [xu2020crnn_sjtu] both adopted a similar CRNN encoder without using multi-level features. Xu et al. [xu2021invest_cnn_crnn] compared CNN and CRNN encoders, they showed that CRNN outperformed CNN when the encoders are trained from scratch but CRNN brought little improvement when pre-training was applied. In general, CRNNs need more computation than CNNs but offer limited improvement.
2.2.4 Other approaches
Transformer and its variants have been probably the most popular models in the fields of NLP and CV since 2017[vaswani2017attention]. Self-attention based encoders are also employed in recent works in audio captioning. Koizumi et al. [koizumi2020keywords] introduce a self-attention block after CNN layers in the encoder to learn the temporal relationship between CNN features. Mei et al. [Mei2021ACT] proposed Audio Captioning Transformer (ACT), where the encoder is a convolution-free Transformer and directly models the relationships between the split spectrograms. ACT shows comparable performance with CNN-based methods. In addition, convolution and self-attention can be combined as in [Narisetty2021ac_conformer] by leveraging a convolution-augmented Transformer (Conformer) [gulati2020conformer].
In summary, various neural network architectures have been investigated in order to obtain robust audio representations. Early works mainly adopted RNNs as encoders, and the trend shifts from RNNs to CNNs and CRNNs due to their superior performance. Novel attention-based architectures also show a strong competitiveness in learning robust audio features.
3 Text decoding
The aim of the language decoder is to generate a caption given audio features from the encoder. Existing works all adopt an auto-regressive model, where each predicted word is conditioned on previous predictions. In addition to the main decoder block, there is often a word embedding layer before the main decoder block, which embeds each input word into a fixed-dimension vector. In this section, we first introduce popular word embeddings and then discuss main text decoding approaches.
3.1 Word embeddings
Word embedding is a means of converting discrete words into vectors, thus text can be processed by the neural networks. The simplest method is called one-hot encoding, which embeds each word into a one-hot vector with the dimension equal to the size of the vocabulary. This method suffers from the curse of dimension and the loss of semantic relationship information, thus it is not widely used[jurafsky2009slp2]. Pre-trained word embeddings are becoming popular in recent years. These pre-trained word embeddings are trained using neural networks with large corpus, and could capture semantic information, that is, semantically similar words are close to each other in the embedding space [mikolov2013word2vec]. Word2Vec [mikolov2013word2vec], GloVe [pennington2014glove] and fastText [mikolov2018fasttext] are widely used in existing audio captioning works [kim2019audiocaps, chen2020ac_CNN, xu2020crnn_sjtu, koizumi2020keywords, eren2020acfake]
. With the popularity of large pre-trained language models, Weck et al.[Weck2021ac_offtheshelf] employed BERT [devlin2019bert] as a feature extractor to obtain word embeddings. They also compared the effect of different pre-trained word embeddings and found BERT embedding leads to the best performance, others such as Word2Vec and GloVe also provide slight improvement as compared to randomly initialized word embeddings.
3.2 Neural networks
Sentences are also sequential data composed by discrete words, thus RNNs are popularly employed as the language decoder. Figure 2 shows a diagram of RNN-based language decoder. The audio feature sequence from the audio encoder is first aggregated to get a global feature representation and then passed to the RNN decoder for caption generation. Drossos et al. [drossos2017ac_1] proposed a 2-layer GRU as the decoder in the initial work. Due to the limited availability of data, many subsequent works have adopted single-layer RNNs, either GRU or LSTM [Ikawa2019ac_spec, kim2019audiocaps, Nguyen2020temporalsub, Cakir2020ac_multitask, Ye2021peking]. The main differences among these works are on how to interact with audio features from the encoder side. In a simple recipe, a global audio feature representation is obtained by applying mean pooling on the audio feature sequence extracted by the encoder, which is then used as the initial hidden state of the RNN decoder [wu2019audiocaption_hospital, xu2021audiocaption_car] or is injected to the RNN decoder at each time step [Nguyen2020temporalsub, eren2020acfake]. This simple mean pooling method for getting a global audio representation is widely used in audio tagging tasks to detect what audio events are present in the whole audio clip [kong2020panns]. However, this method does not consider the relationships between audio features and is unable to capture the fine-grained information about audio events. These fine-grained information could be important for caption generation. Attention mechanism has been employed to overcome this problem [drossos2017ac_1]. When generating a word at each time step, the RNN decoder can attend to the whole audio feature sequence and place more weights on the informative audio features. Thus, the global audio representation at each time step is a different combination of the whole audio feature sequence. In addition, to better exploit previously generated words, Ye et al. [Ye2021peking] introduced another attention module to attend to previously generated words at each time step.
RNNs with attention mechanism show reasonable performance in audio captioning and are widely used. The main disadvantage of RNNs is that they may be struggling to capture long-range dependencies between the generated words. Fortunately, the audio captions are usually short in length, thus the RNN decoders do not need to model very long-range dependencies.
Since Vaswani et al. [vaswani2017attention] proposed Transformer in 2017, the self-attention mechanism has quickly become the basic building block in large language models. Transformer-based models such as BERT [devlin2019bert], GPT [radford2019gpt2] and BART [lewis2020bart] show superior performance to RNNs and dominate the text-related tasks in the field of NLP. Transformer-based models are also employed in audio captioning works recently and achieve state-of-the-art performance. The Transformer decoder is a stack of blocks, in which each block consists of a masked self-attention module, a cross-attention module and a feed-forward layer module. Figure 3 shows a diagram of Transformer-based language decoder, the audio feature sequence can directly interact with the Transformer decoder through the cross-attention module without the need to generate a global audio feature representation. When generating a word at each time step, the masked multi-head attention module attends to the previously generated words to exploit context information, the output of the masked self-attention module then acts as queries and interacts with the audio feature sequence from the encoder that acts as keys and values in the cross-attention module. The standard Transformer decoder is widely used [chen2020ac_CNN, Mei2021ac_trans, Han2021netease, Mei2021ACT]. Due to the limited data available in audio captioning, many of these works use shallow Transformer decoders, usually only two layers, unlike NLP tasks where very deep Transformers are often used. Some modifications to the standard Transformer architecture were also investigated, Xiao et al. [xiao2022local] introduced an attention-free Transformer decoder that could capture local information within audio features.
In summary, Transformer-based decoders show state-of-the-art performance in audio captioning, and they are also computationally efficient compared with RNN-based decoders.
4 Auxiliary information
In addition to the standard encoder-decoder architecture, researchers have investigated the use of auxiliary information such as keywords or sentence information to guide the caption generation. In this section, we discuss the auxiliary information used in the literature.
Directly mapping audio signals to sentences is challenging, thus keywords are widely employed to solve the word-selection indeterminacy problem and guide the caption generation [koizumi2020keywords]. To get the keywords, Kim et al. [kim2019audiocaps] retrieve the nearest training audio clip from AudioSet, the largest audio event dataset, and convert its audio tagging labels as keywords. They then align these keywords with the audio features via an attention flow and then feed the output into the decoder. Some datasets may not have corresponding label information for each audio clip, and in this case, researchers first extract keywords or tags from human-annotated captions according to some rules such as frequency of the words and part-of-speech of the words [chen2020ac_CNN, koizumi2020keywords, Cakir2020ac_multitask, Han2021netease]. Different methods were investigated to make use of the keywords. Cakir et al. [Cakir2020ac_multitask] introduce a keyword decoder to detect keywords of an audio clip and is jointly trained with the audio captioning model. Chen et al. [chen2020ac_CNN] extract keywords from captions, and pre-train the audio encoder with an audio tagging task to enhance the ability of the encoder in learning robust audio patterns. Koizumi et al. [koizumi2020keywords] employ a keyword estimation branch after the encoder, and combine the keywords with audio features before passing them to the language decoder. Ye et al. [Ye2021peking] utilize multi-scale CNN features for keyword prediction. However, some researchers found that keywords might not really improve the system performance in some situations. Takeuchi et al. [Takeuchi2020ntt_first] found that keywords may not work well when the model was trained from scratch while Ye et al. [Ye2021peking] also claimed their model did not converge when only using keywords information. The accuracy of the keywords could be a bottleneck as wrong keywords might impact adversely on the captioning performance.
Sentence information is also investigated. Ikawa et al. [Ikawa2019ac_spec] introduce a ‘specificity’ term to measure the output text based on the amount of information it carries. The model is trained to generate captions whose ‘specificity’ are close to ground truth captions. Similarly, Xu et al. [xu2021audiocaption_car] introduce a sentence loss to generate captions with higher semantic similarity to their ground truth, where they employ a pre-trained language model BERT [devlin2019bert] to get the sentence embedding.
Although different auxiliary information was used to improve the caption generation process, these methods have not brought significant improvements and they may not work well for all datasets. In the DCASE challenge 2021, most teams still used the standard encoder-decoder model without using auxiliary information and achieved promising results. How to improve the AAC system with the auxiliary information still needs more investigation.
|Reference||Audio Encoder||Language Decoder||Key aspects|
|Drossos et al. [drossos2017ac_1]||RNN||RNN||Attention|
|Wu et al. [wu2019audiocaption_hospital]||RNN||RNN||NA|
|Xu et al. [xu2021audiocaption_car]||RNN||RNN||Sentence similarity loss|
|Ikawa et al. [Ikawa2019ac_spec]||RNN||RNN||‘Specificity’ term|
|Kim et al. [kim2019audiocaps]||CNN(VGGish)+RNN||RNN||Multi-scale features, Semantic attention|
|Nguyen et al. [Nguyen2020temporalsub]||RNN||RNN||Temporal subsampling|
|Cakir et al. [Cakir2020ac_multitask]||RNN||RNN||Multi-task learning (keywords)|
|Perez-Castanos et al. [Perez-Castanos2020ac_gamma]||CNN||RNN||Attention|
|Chen et al. [chen2020ac_CNN]||CNN||Transformer||Pre-trained encoder|
|Xu et al. [xu2020crnn_sjtu]||CRNN||RNN||Reinforcement learning|
|Takeuchi et al. [Takeuchi2020ntt_first]||CNN+RNN||RNN||Keywords, sentence length estimation|
|Tran et al. [tran2020wavetransformer]||CNN||Transformer||1-D and 2-D CNN|
|Eren et al. [eren2020acfake]||CNN(PANNs)+RNN||RNN||Keywords|
|Koizumu et al. [koizumi2020keywords]||CNN(VGGish)+Transformer||Transformer||Keywords|
|Koizumu et al. [koizumi2020ac_gpt2]||CNN(VGGish)||GPT-2+Transformer||GPT-2, similar captions retrieval|
|Xu et a. [xu2021invest_cnn_crnn]||CNNCRNN||RNN||
Attention, transfer learning
|Mei et al. [Mei2021ac_trans]||CNN(PANNs)||Transformer||Transfer learning and reinforcement learning|
|Mei et al. [Mei2021ACT]||Transformer||Transformer||
Full transformer network
|Han et al. [Han2021netease]||CNN(PANNs)||Transformer||Weakly-supervised pre-training, keywords|
|Ye et al. [Ye2021peking]||CNN(PANNs)||RNN||Keyword, attention|
|Gontier et al. [Gontier2021ac_bart]||CNN(VGGish)||BART||YAMNet tags, BART|
|Narisetty et al. [Narisetty2021ac_conformer]||CNN(PANNs)+Conformer||Transformer+RNN||ASR techniques|
|Liu et al. [Liu2021cl4ac]||CNN(PANNs)||Transformer||Contrastive learning|
|Won et al. [Won2021ac]||CNN(PANNs)||Transformer||Transfer learning|
|Berg et al. [Berg2021continual]||CNN||Transformer||Continual learning|
|Weck et al. [Weck2021ac_offtheshelf]||CNN(VGGish,YAMNet,OpenL3,COALA)||Transformer||Transfer learning|
|Mei et al. [mei2021diverse]||CNN(PANNs)||Transformer||GAN, diversity|
|Xiao et al. [xiao2022local]||CNN||Transformer||Attention-free Transformer|
|Liu et al. [liu2022leveraging]||CNN(PANNs)||BERT||Transfer learning, BERT|
|Chen et al. [chen2022contrastive]||CNN||Transformer||Transfer learning, contrastive learning|
5 Training strategies
Supervised training with a cross-entropy (CE) loss is a standard recipe for training an audio captioning model. The main drawback of this setting is that it may cause ‘exposure bias’ due to the discrepancy between training and testing [rennie2017exposurebias]. Reinforcement learning is introduced to solve this problem and directly optimize evaluation metrics. In addition, transfer learning has been widely used to overcome the data scarcity problem. In this section, we discuss the popular training strategies used in the literature.
5.1 Cross-entropy training
The cross-entropy loss with maximum likelihood estimation (MLE) is widely used for training audio captioning models. During training, this approach adopts a ‘teacher-forcing’ strategy [rennie2017exposurebias]. That is, the objective of training is to minimize the negative log-likelihood (equivalent to maximizing the log-likelihood) of current ground truth word given previous ground truth words at each time step, which can be formulated as follows:
where is the ground truth word at time step , is the length of the ground truth caption, is the input audio clip, and are the parameters of the audio captioning model.
The models trained via the cross-entropy loss can generate syntactically correct sentences and achieve high scores in terms of the evaluation metrics [Mei2021ac_trans]. However, there are also some disadvantages. First, the ‘teacher forcing’ strategy brings the problem known as ‘exposure bias’ [rennie2017exposurebias], that is, each word to be predicted is conditioned on previous ground-truth words in the training stage, while it is conditioned on previous output words in the test stage. This discrepancy leads to error accumulation during text generation in the test stage. Second, models tend to generate generic and simple captions even though each audio clip has multiple diverse human-annotated captions in the training set [mei2021diverse]. This is because the MLE training tends to encourage the use of highly frequent words appearing in the ground truth captions.
5.2 Reinforcement learning
Xu et al. [xu2020crnn_sjtu] employ a reinforcement learning approach to solve the ‘exposure bias’ problem and directly optimize the non-differentiable evaluation metrics. In a reinforcement learning setting, the captioning model is regarded as an agent and a policy is determined by the model’s parameters. The agent executes an action at each time step to sample a word according to the policy. Once a sentence is generated, the agent receives a reward of the generated sentence. The goal of training is to optimize the model to maximize the expected reward that could be any evaluation metrics. This can be formalized as minimizing negative expected reward:
where is a sampled caption from the model, is the reward of the sampled caption and are the model parameters. The caption can be sampled via Monte-Carlo sampling [sutton2018reinforcement], however, it is computationally expensive. Another computationally efficient method, self-critical sequence training (SCST) [rennie2017exposurebias] is generally employed. SCST employs the reward of a sentence sampled by greedy search as baseline thus avoids learning an estimate of expected future rewards. The expected gradient with respect a single sample caption can be approximated as:
where is the reward of a caption generated by the current model using a greedy search.
Reinforcement learning could substantially improve the scores of the evaluation metrics, although it is used to optimize only one metric. However, Mei et al. [Mei2021ac_trans] found that reinforcement learning may impact adversely on the quality of generated captions such as introducing repetitive words and incorrect syntax. This also implies that existing evaluation metrics may not correlate well with human judgements.
5.3 Transfer learning
The availability of audio captioning datasets is limited due to the challenging and time consuming process in data collection and annotation. To overcome the data scarcity problem, transfer learning is widely adopted. In the encoder of the captioning system, pre-trained audio neural networks such as VGGish [hershey2017vggish] and PANNs [kong2020panns], are widely used to initialize the parameters of encoders [Mei2021ac_trans, Ye2021peking, Han2021netease]. Xu et al. [xu2021invest_cnn_crnn] investigated the impact of pre-training on the audio captioning performance. They show that the audio encoders pre-trained with audio tagging task give the best performance. In addition, Weck et al. [Weck2021ac_offtheshelf] compare four off-the-shelf audio networks. In all the cases, pre-trained audio encoders substantially improve the performance of the audio captioning system. In the language decoder, although a lot of pre-trained Transformer-based language models have been released in recent years, most of those models cannot be directly used as the language decoder, since the decoder needs to interact with audio features from the encoder via a cross-attention module. Koizumi et al. [koizumi2020ac_gpt2] utilize the GPT-2 [radford2019gpt2] to get word embeddings. Gontier et al. [Gontier2021ac_bart] fine-tune BART [lewis2020bart] conditioned on the pre-trained audio embeddings and tags to generate captions and achieve state-of-the-art performance. To leverage pre-trained BERT [devlin2019bert], Liu et al. [liu2022leveraging] investigate to add cross-attention layers with randomly initialized weights in the pre-trained BERT models as the decoder, they demonstrate the efficacy of the pre-trained BERT models for audio captioning.
In summary, pre-trained audio encoders are proved to be effective to get robust audio features and overcome the data scarcity problem, while how to incorporate existing large pre-trained language models into an audio captioning system still needs further investigation.
5.4 Other approaches
To overcome the data scarcity problem, Liu et al. [Liu2021cl4ac] and Chen et al. [chen2022contrastive] both investigated using contrastive training to learn better alignment between audio and text by constructing contrastive samples during training. Bert et al. [Berg2021continual] presented a continual learning approach for continuously adapting an audio captioning method to new unseen general audio signals without forgetting learned information. Mei et al. [mei2021diverse]
argued that an audio captioning system should have the ability to generate diverse captions for a given audio clip or across similar audio clips like human beings. They proposed an adversarial training framework based on generative adversarial network (GAN)[yu2017seqgan] to encourage the diversity of audio captioning systems. In addition, Narisetty et al. [narisetty2022joint] proposed approaches for end-to-end joint modeling of speech recognition and audio captioning tasks.
A brief overview of the published audio captioning methods is shown in Table LABEL:tab:methods_statistics. Table LABEL:tab:methods_statistics contains the type of deep neural networks used to encode audio information, and the language models used to generate captions, and the final column shows the key aspects of these methods.
6 Evaluation metrics
Evaluating audio captions is a challenging and subjective task, because an audio clip can correspond to several correct captions that may use different words, grammar, and/or describe different parts of the audio clip. Existing works adopt the same evaluation metrics used in image captioning, where most of these metrics are borrowed from NLP tasks such as machine translation and summarization, and the remaining are designed specifically for image captioning. The automatic evaluation metrics compare the machine-generated captions with human-annotated references where the number of references for each audio clip may vary across different datasets. In this section, we first introduce the conventional rule-based metrics, and then discuss some novel model-based metrics.
6.1 Conventional evaluation metrics
BLEU. BLEU (BiLingual Evaluation Understudy) [papineni2002bleu] is originally designed to measure the quality of machine-generated sentences for machine translation systems. BLEU calculates modified -gram precision for up to four: the counts for -grams in candidate sentence is first collected and clipped by their corresponding maximum count in references, the clipped counts are then summed and divided by the total number of candidate -grams, where the -gram is a window consisting of consecutive words. The modified -gram precisions are averaged with uniform weights to account for both adequacy and fluency, where -gram matches account for adequacy while longer -gram matches account for fluency. As precision tends to give short sentences higher scores, a brevity penalty is introduced to penalize short sentences.
ROUGE. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) [lin2004rouge]
includes a set of metrics proposed to measure the quality of a machine-generated summary[lin2004rouge]
. ROUGE-L based on longest common subsequence is widely used in image and audio captioning. ROUGE-L first counts the length of a longest common subsequence between a candidate and a reference, which is then divided by the total lengths of the candidate and reference to get precision and recall respectively. An F-measure combining precision and recall is then calculated as the score of ROUGE-L, which favors more on recall.
METEOR. METEOR (Metric for Evaluation of Translation with Explicit ORdering) [banerjee2005meteor] is also a metric to evaluate machine translation systems. METEOR calculates unigram precision and unigram recall based on an explicit word-to-word matching in terms of their surface forms, stemmed forms, and meanings between a candidate and one or more references. An F-mean placed most of the weight on recall is then computed. To take into account for longer matches, unigrams that are in adjacent positions in candidate and references are grouped into chunks, a penalty based on chunks is introduced and combined with F-mean to give the final METEOR score.
CIDEr. CIDEr (Consensus-based Image Description Evaluation) [vedantam2015cider] is an automatic consensus metric for evaluating image description quality. CIDEr also represents sentences using -grams presented in them, where each -gram is weighted by the term frequency inverse document frequency (TF-IDF) weights because
-grams that commonly occur in a dataset are likely to be non-informative. CIDEr computes the cosine similarity of weighted-grams between candidate and references, which accounts for both precision and recall. Similar to BLEU, CIDEr considers higher order -grams (up to four) to capture grammatical properties and richer semantics.
SPICE. SPICE (Semantic Propositional Image Caption Evaluation) [anderson2016spice]
is an image captioning evaluation metric based on semantic content matching. SPICE parses both candidate and references into scene graphs in which the objects, attributes and relations are encoded. An F-score is then calculated based on the matching of tuples extracted from the candidate and reference scene graphs. SPICE ignores the properties of grammar and fluency of sentences but just focuses on semantic matching.
SPIDEr. SPIDEr [liu2017spider] is proposed for evaluating image captions and used as the official ranking metric in the automatic audio captioning task in DCASE Challenge. SPIDEr is the average of SPICE and CIDEr: the SPICE score ensures captions are semantically faithful to the content, while CIDEr score ensures captions are syntactically fluent.
6.2 Model-based metrics
BERTScore. BERTScore [Zhang2020BERTScore] is an evaluation metric for text generation tasks. Unlike conventional metrics which almost rely on surface-form similarity, BERTScore utilizes pre-trained BERT [devlin2019bert] contextual embeddings that can capture semantic similarity, distant dependencies and ordering. After getting contextual embedding of each token through BERT, BERTScore measures similarity of each token between candidate and references through cosine similarity where each token is matched to the most similar token in the other sentences. The matched token pairs are used to calculate a precision, recall and an F1 measure. Importance weighting with inverse document frequency is also introduced to weight more for rare words.
SentenceBERT. SentenceBERT [reimers2019sentenceBERT] is not essentially an evaluation metric but a modification of the pre-trained BERT [devlin2019bert]. SentenceBERT can be used to obtain fixed-sized embeddings for input sentences. The sentence embeddings are then used to calculate a similarity score, such as cosine similarity, Euclidean distance or other similarities. This enables SentenceBERT to be used for audio captioning for similarity comparison between candidate and reference captions from a sentence perspective.
FENSE. FENSE (Fluency ENhanced Sentence-bert Evaluation) [zhou2021fense] is a model-based evaluation metric specifically proposed for audio captioning. FENSE utilizes the Sentence-BERT to derive sentence embeddings for candidate and reference captions, and calculates its average cosine similarity score. To capture grammar issues like repeated words or phrases and incomplete sentences, FENSE uses a separate pre-trained error detector to penalize the Sentence-BERT scores when fluency issues are detected.
|Kim et al. [kim2019audiocaps]||0.614||0.446||0.203||0.593||0.144|
|Koizumi et al. [koizumi2020ac_gpt2]||0.638||0.458||0.199||0.603||0.139|
|AudioCaps||Mei et al. [Mei2021ACT]||0.647||0.488||0.222||0.679||0.160|
|Xu et al. [xu2021invest_cnn_crnn]||0.655||0.476||0.229||0.66||0.168|
|Liu et al. [liu2022leveraging]||0.671||0.498||0.232||0.667||0.172|
|Gontier et al. [Gontier2021ac_bart]||0.699||0.523||0.241||0.753||0.176|
|Eren et al. [eren2020acfake]||0.710||0.490||0.290||0.750||-|
|Cakir et al. [Cakir2020ac_multitask]||0.409||0.156||0.088||0.107||0.040|
|Nguyen et al. [Nguyen2020temporalsub]||0.417||0.154||0.089||0.093||0.040|
|Tran et al. [tran2020wavetransformer]||0.489||0.303||0.143||0.268||0.095|
|Takeuchi et al. [Takeuchi2020ntt_first]||0.512||0.325||0.343||0.290||0.089|
|Koizumi et al. [koizumi2020keywords]||0.521||0.309||0.149||0.258||0.097|
|Chen et al. [chen2020ac_CNN]||0.534||0.343||0.160||0.346||0.108|
|Xu et al. [xu2020crnn_sjtu]||0.561||0.341||0.162||0.338||0.108|
|Clotho||Narisetty et al. [Narisetty2021ac_conformer]||0.541||0.346||0.161||0.362||0.110|
|Liu et al. [Liu2021cl4ac]||0.553||0.349||0.168||0.368||0.115|
|Xu et al. [xu2021invest_cnn_crnn]||0.556||0.363||0.169||0.377||0.115|
|Chen et al. [chen2022contrastive]||0.572||0.379||0.171||0.407||0.119|
|Xiao et al. [xiao2022local]||0.578||0.387||0.177||0.434||0.122|
|Won et al. [Won2021ac]||0.564||0.376||0.177||0.441||0.128|
|Han et al. [Han2021netease]||0.583||0.391||0.179||0.456||0.128|
|Eren et al. [eren2020acfake]||0.590||0.350||0.220||0.280||-|
|Ye et al. [Ye2021peking]||0.648||-||0.190||0.491||0.131|
|Mei et al. [Mei2021ac_trans]||0.634||0.423||0.187||0.476||0.134|
The release of high quality audio captioning datasets has greatly promoted the development of this area. Existing datasets differ in many aspects such as the number of audio clips, the number of captions per audio clip, and the length of each audio clip. These different characteristics will affect the design and the performance of the audio captioning model. We describe the details of existing datasets in this section. To better understanding the datasets, we then use consensus score of previously introduced metrics to evaluate these datasets.
7.1 Datasets description
AudioCaps. AudioCaps [kim2019audiocaps] is the largest audio captioning dataset so far. All the audio clips are 10-seconds long and are sourced from AudioSet, a large-scale audio event dataset [audioset]. The audio clips are selected by following some selection qualifications that ensure the chosen audio clips are balanced with respect to audio tags and diverse in terms of content. The audio clips are annotated by crowdworkers through Amazon Mechanical Turk (AMT), annotators are provided with an audio clip with corresponding word hints and video hints, and are required to write a natural language description with provided information.
The official release of AudioCaps contains 51k audio clips and is divided into a training set, a validation set and a test set. Each audio clip in the training set contains one corresponding human-annotated caption while those in validation set and test set contain five corresponding captions. As audio clips from AudioSet are extracted from YouTube videos, it is worth noting that some audio clips might be no longer downloadable, thus the number of downloadable audio clips might be varied from the official release of AudioCaps. The statistics in Table LABEL:tab:datasets_statis are reported based on the official release version of AudioCaps.
|AudioCaps||1, 5||10 s||5066||8.79|
|MACS||2, 3 ,4, 5||10 s||2776||9.24|
Clotho. Clotho [drossos2020clotho] is the official ranking dataset used in the task 6 (Automated Audio Captioning) of DCASE challenges in 2020 and 2021. All the audio clips are sourced from the online platform Freesound [font2013freesound] and are almost uniformly ranged from 15 to 30 seconds. Annotators are employed through AMT for crowdsourcing the captions. During the annotation process, no other information such as word hints and video hints in AudioCaps but only the audio signal was available to the annotators, as these auxiliary information may introduce a bias.
The latest Clotho v2 contains audio clips in the training set and audio clips in the validation and evaluation set, respectively. Each audio clip contains five human-annotated captions, ranging from 8 to words long. In the DCASE challenges, all these three published sets are allowed to be used to train the models, and the final performances were assessed using a preserved testing split by the organisers. For reporting results for conference or journal papers, performances are assessed using the published evaluation set and some authors may include the validation set into training. As a result, the model performance reported on Clotho may not be in the same ground.
MACS. MACS (Multi-Annotator Captioned Soundscapes) [Martin2021databias] consists audio clips from the development set of TAU Urban Acoustic Scenes 2019 dataset [Mesaros2018TAU]. The audio clips are all 10-seconds long recorded from three acoustic scenes (airport, public square and park) and are annotated by students. The annotation process contains two stages. Given a list of ten classes and an audio clip, the annotators are first required to select the audio events presented in an audio clip from the given class list. Afterwards, the annotators are required to write a description of the audio clip.
MACS contains audio clips without being split into subsets. The number of captions per audio clip varies in the dataset. Most audio clips have five corresponding human-annotated captions, while some of them may only have two, three or four.
AudioCaption AudioCaption is a domain-specific Mandarin-annotated audio captioning dataset. Two scene-specific sets have been published: one for hospital scene [wu2019audiocaption_hospital] and another for car scene [xu2021audiocaption_car]. The hospital-scene set contains audio clips with three captions per clip while the car-scene set contains audio clips with five captions per clip. All the audio clips are annotated by native Mandarin speakers.
7.2 Datasets evaluation
Since all the datasets mentioned above are annotated under different protocols, these datasets show different characteristics such as the number of captions per audio clip, caption lengths and sample variance in multi-reference captions. We believe these characteristics will influence the design and performance of audio captioning models. To better understand the datasets, we evaluate three English-annotated datasets from different aspects. TableLABEL:tab:methods_scores reports the performance of published methods on two main datasets. Table LABEL:tab:datasets_statis summarizes the datasets with some basic statistics. In addition, we use a consensus score [zhu2020consensus] to represent the agreement among the parallel reference captions for the same audio clip, and the results are shown in Table LABEL:tab:datasets_consensus. The consensus score among parallel reference captions for an audio clip is defined as:
where is the -th caption and the metric can be anyone mentioned above. Since the number of references are varied among different datasets, we report the consensus score of AudioCaps using validation and test set, Clotho using training set and MACS using all the audio clips having five reference captions.
As the consensus scores are computed among the human-annotated captions, they can be also regarded as upper bound human-level performance on each dataset. As can be seen from Table LABEL:tab:datasets_consensus, the consensus scores on AudioCaps and Clotho are close to each other except that the SentenceBERT score on AudioCaps is clearly higher than that of Clotho. Surprisingly, the consensus scores on MACS are lower than the other two datasets while only the SentenceBERT is close to them. This may reveal that the human-annotated captions in MACS are more diverse than the other two, and SentenceBERT can better capture semantic relevance between diverse captions. The consensus scores can be regarded as a measure of the dataset quality to some extent.
8 Challenges and future directions
Many deep learning-based methods have been proposed to improve automated audio captioning systems, and this task has seen rapid progress in recent years. However, there is still a large gap between the performance of the resulting systems and human level performance. In this section, we discuss challenges remaining in this area and envisage possible future research directions.
There are several challenges about data for audio captioning. First, the data scarcity problem is still a main challenge. Existing datasets are limited in size. The collection of an audio captioning dataset is time consuming, and it is hard to control the quality of human-annotated captions. Han et al. [Han2021netease] collect weakly labeled dataset from online available sources to pre-train the AAC model and show that more training data (even weakly-labelled) can greatly improve the system performance. This reveals that we can make use of online available audio clips with their weakly-labelled text description to learn more robust audio-text representation, such as CLIP [radford2021clip] in computer vision.
Second, existing datasets usually do not cover all possible real-life scenarios, and thus, audio captioning models cannot generalize well to different contexts. Martin et al. [Martin2021databias] investigate dataset bias of existing datasets from a lexical perspective. The bias problem still needs more investigation, e.g. how it will influence the model performance.
8.2 Model and training strategies
Existing AAC methods all follow the encoder-decoder paradigm and generate sentences in an auto-regressive manner. These two techniques have been the standard recipe for audio captioning models. Nonetheless, novel methods should be investigated in future research. For example, BERT-like architectures which fuse acoustic and textual modalities in early stage can be a replacement for the encoder-decoder paradigm, and work well in image captioning [zhou2020unified, li2020oscar]. Non-auto-regressive language models could reduce the inference time by generating all words in parallel [gao2019non_auto], which might be a worthwhile research direction.
For the training strategies, standard cross-entropy loss brings the problem of ‘exposure bias’ and tends to generate simple and generic captions. Although reinforcement learning is introduced to solve this problem, reinforcement learning may impact adversely on the quality of generated captions. A promising exploration line is to design new objective functions or add human feedback in a reinforcement learning setting to solve these problems. In addition, how to make use of learned knowledge in large pre-trained language models to help caption generation needs more investigation.
The performance of audio captioning systems is generally assessed by comparing machine-generated captions with human-annotated reference captions. Each test audio clip is provided with multiple references to capture the possible variations as an audio clip can be described using distinct words and grammars from different perspectives. First, the multiple references cannot capture all possible variations, thus a reasonable caption that is consistent with the content of the given audio clip but uses different words with the references may receive a low score. Second, as argued in [Mei2021ac_trans], existing evaluation metrics do not correlate well with human judgements. Incorporating human evaluation is time-consuming and expensive, future work should figure out to what extent that existing metrics correlate with human judgements, and develop more robust evaluation metrics.
8.4 Diversity and stylized captions
As argued in [dai2017im_diverse], a good captioning model should generate sentences that possess three properties: fidelity, the generated captions should reflect the audio content faithfully, naturalness: the captions cannot be distinguished as machine-generated, diversity: the sentences should have rich and varied expressions like different people would describe an audio clip in different ways. However, many existing approaches only consider semantic fidelity. Further research should be conducted to improve the other two properties. In addition, stylized captioning system should be a worthwhile research direction, where the captioning systems can be used for different audiences such as kids.
8.5 Potential directions
There are also other potential directions for audio captioning. For example, temporal information of the sound events is not well used in existing works, future work could investigate the use of information related to activities and timing information of sound events to generate more accurate captions. Information from other modalities could be also employed to train the audio captioning models, such as using audio-visual captioning methods [tian2018attempt, iashin2020better]. In addition, audio captioning can be potentially linked with other audio-language multi-modal tasks, such as audio-text retrieval [koepke2022audioretrieval], audio question answering [fayek2020temporal], text-based audio generation [liu2021conditional] and text-based audio source separation [liu2022separate].
Audio captioning is a fast developing task involving both audio processing and natural language processing. In this paper, we have reviewed published audio captioning methods from the perspective of audio encoding and text decoding. We discussed auxiliary information employed to guide the caption generation, and training strategies adopted in the literature. In addition, main evaluation metrics and datasets are reviewed. We briefly outlined challenges and potential research directions in this area. We hope this survey can serve as a comprehensive introduction to audio captioning and encourage novel ideas for future research.
For the purpose of open access, the authors have applied a creative commons attribution (CC BY) licence to any author accepted manuscript version arising.
This work is partly supported by grant EP/T019751/1 from the Engineering and Physical Sciences Research Council (EPSRC), a Newton Institutional Links Award from the British Council, titled “Automated Captioning of Image and Audio for Visually and Hearing Impaired” (Grant number 623805725), and a Research Scholarship from the China Scholarship Council (CSC) No.202006470010.
AAC: Automated audio captioning, NLP: Natural language processing, ASR: Automatic speech recognition, RNN: Recurrent neural network, CNN: Convolutional neural network, STFT: Short time Fourier transformer, MFCCs: Mel-frequency cepstral coefficients, DCT: Discrete cosine transform, GRU: Gated recurrent unit, LSTM: Long-short term memory, CV: Computer vision, CRNN: Convolutional recurrent neural network, CE: Cross-entropy, MLE: Maximum likelihood estimation, GAN: Generative adversarial network.
Availability of data and materials
The datasets analysed during this article are available on the internet.
XM was a major contributor in writing the manuscript. XL summarized challenges and future work. MP and WW substantively revised the manuscript. All authors read and approved the final manuscript.
WW is an editorial board member of EURASIP Journal on Audio Speech and Music Processing and also a guest editor of the special issue ”Recent Advances in Computational Sound Scene Analysis”, other authors declare that they have no competing interests.
The authors are with Center for Vision Speech and Signal Processing (CVSSP), Department of Electrical and Electronic Engineering, Faculty of Engineering and Physical Sciences, University of Surrey, Guildford, GU2 7XH, UK.