YFACC: A Yorùbá speech-image dataset for cross-lingual keyword localisation through visual grounding

10/10/2022
by   Kayode Olaleye, et al.
0

Visually grounded speech (VGS) models are trained on images paired with unlabelled spoken captions. Such models could be used to build speech systems in settings where it is impossible to get labelled data, e.g. for documenting unwritten languages. However, most VGS studies are in English or other high-resource languages. This paper attempts to address this shortcoming. We collect and release a new single-speaker dataset of audio captions for 6k Flickr images in Yorùbá – a real low-resource language spoken in Nigeria. We train an attention-based VGS model where images are automatically tagged with English visual labels and paired with Yorùbá utterances. This enables cross-lingual keyword localisation: a written English query is detected and located in Yorùbá speech. To quantify the effect of the smaller dataset, we compare to English systems trained on similar and more data. We hope that this new dataset will stimulate research in the use of VGS models for real low-resource languages.

READ FULL TEXT
research
02/01/2023

Visually Grounded Keyword Detection and Localisation for Low-Resource Languages

This study investigates the use of Visually Grounded Speech (VGS) models...
research
04/24/2019

On the Contributions of Visual and Textual Supervision in Low-resource Semantic Speech Retrieval

Recent work has shown that speech paired with images can be used to lear...
research
02/08/2019

Models of Visually Grounded Speech Signal Pay Attention To Nouns: a Bilingual Experiment on English and Japanese

We investigate the behaviour of attention in neural models of visually g...
research
06/13/2018

Visually grounded cross-lingual keyword spotting in speech

Recent work considered how images paired with speech can be used as supe...
research
11/23/2018

Learning pronunciation from a foreign language in speech synthesis networks

Although there are more than 65,000 languages in the world, the pronunci...
research
06/20/2023

Visually grounded few-shot word learning in low-resource settings

We propose a visually grounded speech model that learns new words and th...
research
09/06/2023

RoDia: A New Dataset for Romanian Dialect Identification from Speech

Dialect identification is a critical task in speech processing and langu...

Please sign up or login with your details

Forgot password? Click here to reset