Visually grounded learning of keyword prediction from untranscribed speech

03/23/2017
by   Herman Kamper, et al.
0

During language acquisition, infants have the benefit of visual cues to ground spoken language. Robots similarly have access to audio and visual sensors. Recent work has shown that images and spoken captions can be mapped into a meaningful common space, allowing images to be retrieved using speech and vice versa. In this setting of images paired with untranscribed spoken captions, we consider whether computer vision systems can be used to obtain textual labels for the speech. Concretely, we use an image-to-words multi-label visual classifier to tag images with soft textual labels, and then train a neural network to map from the speech to these soft targets. We show that the resulting speech system is able to predict which words occur in an utterance---acting as a spoken bag-of-words classifier---without seeing any parallel speech and text. We find that the model often confuses semantically related words, e.g. "man" and "person", making it even more effective as a semantic keyword spotter.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/02/2022

Keyword localisation in untranscribed speech using visually grounded speech models

Keyword localisation is the task of finding where in a speech utterance ...
research
06/16/2021

Attention-Based Keyword Localisation in Speech using Visual Grounding

Visually grounded speech models learn from images paired with spoken cap...
research
10/05/2017

Semantic keyword spotting by learning from images and speech

We consider the problem of representing semantic concepts in speech by l...
research
12/21/2018

Symbolic inductive bias for visually grounded learning of spoken language

A widespread approach to processing spoken language is to first automati...
research
04/24/2019

On the Contributions of Visual and Textual Supervision in Low-resource Semantic Speech Retrieval

Recent work has shown that speech paired with images can be used to lear...
research
04/04/2018

Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input

In this paper, we explore neural network models that learn to associate ...
research
02/22/2022

Hidden bawls, whispers, and yelps: can text be made to sound more than just its words?

Whether a word was bawled, whispered, or yelped, captions will typically...

Please sign up or login with your details

Forgot password? Click here to reset