Visually grounded few-shot word learning in low-resource settings

06/20/2023
by   Leanne Nortje, et al.
0

We propose a visually grounded speech model that learns new words and their visual depictions from just a few word-image example pairs. Given a set of test images and a spoken query, we ask the model which image depicts the query word. Previous work has simplified this few-shot learning problem by either using an artificial setting with digit word-image pairs or by using a large number of examples per class. Moreover, all previous studies were performed using English speech-image data. We propose an approach that can work on natural word-image pairs but with less examples, i.e. fewer shots, and then illustrate how this approach can be applied for multimodal few-shot learning in a real low-resource language, Yoruba. Our approach involves using the given word-image example pairs to mine new unsupervised word-image training pairs from large collections of unlabelledspeech and images. Additionally, we use a word-to-image attention mechanism to determine word-image similarity. With this new model, we achieve better performance with fewer shots than previous approaches on an existing English benchmark. Many of the model's mistakes are due to confusion between visual concepts co-occurring in similar contexts. The experiments on Yoruba show the benefit of transferring knowledge from a multimodal model trained on a larger set of English speech-image data.

READ FULL TEXT

page 1

page 2

page 3

page 4

page 6

page 7

page 10

research
05/25/2023

Visually grounded few-shot word acquisition with fewer shots

We propose a visually grounded speech model that acquires new words and ...
research
12/10/2020

Direct multimodal few-shot learning of speech and images

We propose direct multimodal few-shot models that learn a shared embeddi...
research
10/10/2022

YFACC: A Yorùbá speech-image dataset for cross-lingual keyword localisation through visual grounding

Visually grounded speech (VGS) models are trained on images paired with ...
research
11/09/2018

Multimodal One-Shot Learning of Speech and Images

Imagine a robot is shown new concepts visually together with spoken tags...
research
08/14/2020

Unsupervised vs. transfer learning for multimodal one-shot matching of speech and images

We consider the task of multimodal one-shot speech-image matching. An ag...
research
04/04/2023

Sociocultural knowledge is needed for selection of shots in hate speech detection tasks

We introduce HATELEXICON, a lexicon of slurs and targets of hate speech ...
research
04/26/2021

Non-Parametric Few-Shot Learning for Word Sense Disambiguation

Word sense disambiguation (WSD) is a long-standing problem in natural la...

Please sign up or login with your details

Forgot password? Click here to reset