Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning

12/20/2017
by   Andrew Owens, et al.
0

The sound of crashing waves, the roar of fast-moving cars -- sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate this, we train a convolutional neural network to predict a statistical summary of the sound associated with a video frame. We show that, through this process, the network learns a representation that conveys information about objects and scenes. We evaluate this representation on several recognition tasks, finding that its performance is comparable to that of other state-of-the-art unsupervised learning methods. Finally, we show through visualizations that the network learns units that are selective to objects that are often associated with characteristic sounds. This paper extends an earlier conference paper, Owens et al. 2016, with additional experiments and discussion.

READ FULL TEXT

page 2

page 4

page 5

page 7

page 9

page 10

page 11

page 17

research
08/25/2016

Ambient Sound Provides Supervision for Visual Learning

The sound of crashing waves, the roar of fast-moving cars -- sound conve...
research
12/25/2019

Improving Visual Recognition using Ambient Sound for Supervision

Our brains combine vision and hearing to create a more elaborate interpr...
research
11/10/2021

Structure from Silence: Learning Scene Structure from Ambient Sound

From whirling ceiling fans to ticking clocks, the sounds that we hear su...
research
01/29/2018

Local Visual Microphones: Improved Sound Extraction from Silent Video

Sound waves cause small vibrations in nearby objects. A few techniques e...
research
12/04/2017

Visual to Sound: Generating Natural Sound for Videos in the Wild

As two of the five traditional human senses (sight, hearing, taste, smel...
research
05/11/2020

Foreground-Background Ambient Sound Scene Separation

Ambient sound scenes typically comprise multiple short events occurring ...
research
05/15/2013

Bioacoustic Signal Classification Based on Continuous Region Processing, Grid Masking and Artificial Neural Network

In this paper, we develop a novel method based on machine-learning and i...

Please sign up or login with your details

Forgot password? Click here to reset