Direct multimodal few-shot learning of speech and images

12/10/2020
by   Leanne Nortje, et al.
0

We propose direct multimodal few-shot models that learn a shared embedding space of spoken words and images from only a few paired examples. Imagine an agent is shown an image along with a spoken word describing the object in the picture, e.g. pen, book and eraser. After observing a few paired examples of each class, the model is asked to identify the "book" in a set of unseen pictures. Previous work used a two-step indirect approach relying on learned unimodal representations: speech-speech and image-image comparisons are performed across the support set of given speech-image pairs. We propose two direct models which instead learn a single multimodal space where inputs from different modalities are directly comparable: a multimodal triplet network (MTriplet) and a multimodal correspondence autoencoder (MCAE). To train these direct models, we mine speech-image pairs: the support set is used to pair up unlabelled in-domain speech and images. In a speech-to-image digit matching task, direct models outperform indirect models, with the MTriplet achieving the best multimodal five-shot accuracy. We show that the improvements are due to the combination of unsupervised and transfer learning in the direct models, and the absence of two-step compounding errors.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/14/2020

Unsupervised vs. transfer learning for multimodal one-shot matching of speech and images

We consider the task of multimodal one-shot speech-image matching. An ag...
research
11/09/2018

Multimodal One-Shot Learning of Speech and Images

Imagine a robot is shown new concepts visually together with spoken tags...
research
06/20/2023

Visually grounded few-shot word learning in low-resource settings

We propose a visually grounded speech model that learns new words and th...
research
05/25/2023

Visually grounded few-shot word acquisition with fewer shots

We propose a visually grounded speech model that acquires new words and ...
research
11/11/2015

Deep Multimodal Semantic Embeddings for Speech and Images

In this paper, we present a model which takes as input a corpus of image...
research
11/28/2020

Unsupervised Spoken Term Discovery Based on Re-clustering of Hypothesized Speech Segments with Siamese and Triplet Networks

Spoken term discovery from untranscribed speech audio could be achieved ...
research
04/26/2016

Using Indirect Encoding of Multiple Brains to Produce Multimodal Behavior

An important challenge in neuroevolution is to evolve complex neural net...

Please sign up or login with your details

Forgot password? Click here to reset