Multimodal One-Shot Learning of Speech and Images

11/09/2018
by   Ryan Eloff, et al.
0

Imagine a robot is shown new concepts visually together with spoken tags, e.g. "milk", "eggs", "butter". After seeing one paired audio-visual example per class, it is shown a new set of unseen instances of these objects, and asked to pick the "milk". Without receiving any hard labels, could it learn to match the new continuous speech input to the correct visual instance? Although unimodal one-shot learning has been studied, where one labelled example in a single modality is given per class, this example motivates multimodal one-shot learning. Our main contribution is to formally define this task, and to propose several baseline and advanced models. We use a dataset of paired spoken and visual digits to specifically investigate recent advances in Siamese convolutional neural networks. Our best Siamese model achieves twice the accuracy of a nearest neighbour model using pixel-distance over images and dynamic time warping over speech in 11-way cross-modal matching.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/14/2020

Unsupervised vs. transfer learning for multimodal one-shot matching of speech and images

We consider the task of multimodal one-shot speech-image matching. An ag...
research
12/10/2020

Direct multimodal few-shot learning of speech and images

We propose direct multimodal few-shot models that learn a shared embeddi...
research
01/16/2023

Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models

The ability to quickly learn a new task with minimal instruction - known...
research
01/10/2018

Weakly Supervised One-Shot Detection with Attention Siamese Networks

We consider the task of weakly supervised one-shot detection. In this ta...
research
10/14/2022

Realizing Flame State Monitoring with Very Few Visual or Infrared Images via Few-Shot Learning

The success of current machine learning on image-based combustion monito...
research
06/20/2023

Visually grounded few-shot word learning in low-resource settings

We propose a visually grounded speech model that learns new words and th...
research
06/07/2021

One-shot learning of paired associations by a reservoir computing model with Hebbian plasticity

One-shot learning can be achieved by algorithms and animals, but how the...

Please sign up or login with your details

Forgot password? Click here to reset