Self-supervised Moving Vehicle Tracking with Stereo Sound

10/25/2019
by   Chuang Gan, et al.
10

Humans are able to localize objects in the environment using both visual and auditory cues, integrating information from multiple modalities into a common reference frame. We introduce a system that can leverage unlabeled audio-visual data to learn to localize objects (moving vehicles) in a visual reference frame, purely using stereo sound at inference time. Since it is labor-intensive to manually annotate the correspondences between audio and object bounding boxes, we achieve this goal by using the co-occurrence of visual and audio streams in unlabeled videos as a form of self-supervision, without resorting to the collection of ground-truth annotations. In particular, we propose a framework that consists of a vision "teacher" network and a stereo-sound "student" network. During training, knowledge embodied in a well-established visual vehicle detection model is transferred to the audio domain using unlabeled videos as a bridge. At test time, the stereo-sound student network can work independently to perform object localization us-ing just stereo audio and camera meta-data, without any visual input. Experimental results on a newly collected Au-ditory Vehicle Tracking dataset verify that our proposed approach outperforms several baseline approaches. We also demonstrate that our cross-modal auditory localization approach can assist in the visual localization of moving vehicles under poor lighting conditions.

READ FULL TEXT

page 1

page 4

page 5

page 7

page 8

research
01/30/2022

Self-Supervised Moving Vehicle Detection from Audio-Visual Cues

Robust detection of moving vehicles is a critical task for any autonomou...
research
03/09/2020

Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds

Humans can robustly recognize and localize objects by integrating visual...
research
09/06/2021

Binaural SoundNet: Predicting Semantics, Depth and Motion with Binaural Sounds

Humans can robustly recognize and localize objects by using visual and/o...
research
04/26/2022

Sound Localization by Self-Supervised Time Delay Estimation

Sounds reach one microphone in a stereo pair sooner than the other, resu...
research
04/09/2018

The Sound of Pixels

We introduce PixelPlayer, a system that, by leveraging large amounts of ...
research
10/27/2016

SoundNet: Learning Sound Representations from Unlabeled Video

We learn rich natural sound representations by capitalizing on large amo...

Please sign up or login with your details

Forgot password? Click here to reset