Multilingual Audio Captioning using machine translated data

09/14/2023
by   Matéo Cousin, et al.
0

Automated Audio Captioning (AAC) systems attempt to generate a natural language sentence, a caption, that describes the content of an audio recording, in terms of sound events. Existing datasets provide audio-caption pairs, with captions written in English only. In this work, we explore multilingual AAC, using machine translated captions. We translated automatically two prominent AAC datasets, AudioCaps and Clotho, from English to French, German and Spanish. We trained and evaluated monolingual systems in the four languages, on AudioCaps and Clotho. In all cases, the models achieved similar performance, about 75 captions of the AudioCaps eval subset. The French system, trained on the machine translated version of AudioCaps, achieved significantly better results on the manual eval subset, compared to the English system for which we automatically translated the outputs to French. This advocates in favor of building systems in a target language instead of simply translating to a target language the English captions from the English system. Finally, we built a multilingual model, which achieved results in each language comparable to each monolingual system, while using much less parameters than using a collection of monolingual systems.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/20/2017

Using Artificial Tokens to Control Languages for Multilingual Image Caption Generation

Recent work in computer vision has yielded impressive results in automat...
research
05/01/2020

Cross-modal Language Generation using Pivot Stabilization for Web-scale Language Coverage

Cross-modal language generation tasks such as image captioning are direc...
research
04/06/2019

VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research

We present a new large-scale multilingual video description dataset, VAT...
research
02/25/2019

Audio Caption: Listen and Tell

Increasing amount of research has shed light on machine perception of au...
research
06/08/2023

DLAMA: A Framework for Curating Culturally Diverse Facts for Probing the Knowledge of Pretrained Language Models

A few benchmarking datasets have been released to evaluate the factual k...
research
03/10/2023

Naver Labs Europe (SPLADE) @ TREC NeuCLIR 2022

This paper describes our participation in the 2022 TREC NeuCLIR challeng...
research
05/02/2023

Multitask learning in Audio Captioning: a sentence embedding regression loss acts as a regularizer

In this work, we propose to study the performance of a model trained wit...

Please sign up or login with your details

Forgot password? Click here to reset