Can Audio Captions Be Evaluated with Image Caption Metrics?

10/10/2021
by   Zelin Zhou, et al.
0

Automated audio captioning aims at generating textual descriptions for an audio clip. To evaluate the quality of generated audio captions, previous works directly adopt image captioning metrics like SPICE and CIDEr, without justifying their suitability in this new domain, which may mislead the development of advanced models. This problem is still unstudied due to the lack of human judgment datasets on caption quality. Therefore, we firstly construct two evaluation benchmarks, AudioCaps-Eval and Clotho-Eval. They are established with pairwise comparison instead of absolute rating to achieve better inter-annotator agreement. Current metrics are found in poor correlation with human annotations on these datasets. To overcome their limitations, we propose a metric named FENSE, where we combine the strength of Sentence-BERT in capturing similarity, and a novel Error Detector to penalize erroneous sentences for robustness. On the newly established benchmarks, FENSE outperforms current metrics by 14-25 available at: https://github.com/blmoistawinde/fense

READ FULL TEXT
research
10/29/2022

Improving Audio Captioning Using Semantic Similarity Metrics

Audio captioning quality metrics which are typically borrowed from the m...
research
05/31/2019

What does a Car-ssette tape tell?

Captioning has attracted much attention in image and video understanding...
research
04/10/2023

ImageCaptioner^2: Image Captioner for Image Captioning Bias Amplification Assessment

Most pre-trained learning systems are known to suffer from bias, which t...
research
06/17/2018

Learning to Evaluate Image Captioning

Evaluation metrics for image captioning face two challenges. Firstly, co...
research
09/06/2023

Detecting False Alarms and Misses in Audio Captions

Metrics to evaluate audio captions simply provide a score without much e...
research
11/12/2022

Investigations in Audio Captioning: Addressing Vocabulary Imbalance and Evaluating Suitability of Language-Centric Performance Metrics

The analysis, processing, and extraction of meaningful information from ...
research
03/30/2023

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

The advancement of audio-language (AL) multimodal learning tasks has bee...

Please sign up or login with your details

Forgot password? Click here to reset