Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input

04/04/2018
by   David Harwath, et al.
2

In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to. We demonstrate that these audio-visual associative localizations emerge from network-internal representations learned as a by-product of training to perform an image-audio retrieval task. Our models operate directly on the image pixels and speech waveform, and do not rely on any conventional supervision in the form of labels, segmentations, or alignments between the modalities during training. We perform analysis using the Places 205 and ADE20k datasets demonstrating that our models implicitly learn semantically-coupled object and word detectors.

READ FULL TEXT

page 2

page 6

page 10

page 11

page 13

page 14

research
01/25/2017

Learning Word-Like Units from Joint Audio-Visual Analysis

Given a collection of images and spoken audio captions, we present a met...
research
12/31/2020

Text-Free Image-to-Speech Synthesis Using Learned Segmental Units

In this paper we present the first model for directly synthesizing fluen...
research
11/08/2021

Cascaded Multilingual Audio-Visual Learning from Videos

In this paper, we explore self-supervised audio-visual models that learn...
research
03/23/2017

Visually grounded learning of keyword prediction from untranscribed speech

During language acquisition, infants have the benefit of visual cues to ...
research
10/14/2021

Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset

Visually-grounded spoken language datasets can enable models to learn cr...
research
06/01/2020

A Neural Network Model of Lexical Competition during Infant Spoken Word Recognition

Visual world studies show that upon hearing a word in a target-absent vi...
research
11/17/2015

Learning to retrieve out-of-vocabulary words in speech recognition

Many Proper Names (PNs) are Out-Of-Vocabulary (OOV) words for speech rec...

Please sign up or login with your details

Forgot password? Click here to reset