DeepAI AI Chat
Log In Sign Up

Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input

by   David Harwath, et al.

In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to. We demonstrate that these audio-visual associative localizations emerge from network-internal representations learned as a by-product of training to perform an image-audio retrieval task. Our models operate directly on the image pixels and speech waveform, and do not rely on any conventional supervision in the form of labels, segmentations, or alignments between the modalities during training. We perform analysis using the Places 205 and ADE20k datasets demonstrating that our models implicitly learn semantically-coupled object and word detectors.


page 2

page 6

page 10

page 11

page 13

page 14


Learning Word-Like Units from Joint Audio-Visual Analysis

Given a collection of images and spoken audio captions, we present a met...

Text-Free Image-to-Speech Synthesis Using Learned Segmental Units

In this paper we present the first model for directly synthesizing fluen...

Cascaded Multilingual Audio-Visual Learning from Videos

In this paper, we explore self-supervised audio-visual models that learn...

Visually grounded learning of keyword prediction from untranscribed speech

During language acquisition, infants have the benefit of visual cues to ...

Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset

Visually-grounded spoken language datasets can enable models to learn cr...

A Neural Network Model of Lexical Competition during Infant Spoken Word Recognition

Visual world studies show that upon hearing a word in a target-absent vi...

Learning to retrieve out-of-vocabulary words in speech recognition

Many Proper Names (PNs) are Out-Of-Vocabulary (OOV) words for speech rec...