Objects that Sound

12/18/2017
by   Relja Arandjelović, et al.
0

In this paper our objectives are, first, networks that can embed audio and visual inputs into a common space that is suitable for cross-modal retrieval; and second, a network that can localize the object that sounds in an image, given the audio signal. We achieve both these objectives by training from unlabelled video using only audio-visual correspondence (AVC) as the objective function. This is a form of cross-modal self-supervision from video. To this end, we design new network architectures that can be trained using the AVC task for these functionalities: for cross-modal retrieval, and for localizing the source of a sound in an image. We make the following contributions: (i) show that audio and visual embedding can be learnt that enable both within-mode (e.g. audio-to-audio) and between-mode retrieval; (ii) explore various architectures for the AVC task, including those for the visual stream that ingest a single image, or multiple images, or a single image and multi-frame optical flow; (iii) show that the semantic object that sounds within an image can be localized (using only the sound, no motion or flow information); and (iv) give a cautionary tale in how to avoid undesirable shortcuts in the data preparation.

READ FULL TEXT

page 1

page 3

page 4

page 6

page 7

page 8

research
09/19/2023

Sound Source Localization is All about Cross-Modal Alignment

Humans can easily perceive the direction of sound sources in a visual sc...
research
09/21/2018

Perfect match: Improved cross-modal embeddings for audio-visual synchronisation

This paper proposes a new strategy for learning powerful cross-modal emb...
research
11/21/2022

TimbreCLIP: Connecting Timbre to Text and Images

We present work in progress on TimbreCLIP, an audio-text cross modal emb...
research
07/23/2019

Multisensory Learning Framework for Robot Drumming

The hype about sensorimotor learning is currently reaching high fever, t...
research
09/06/2021

Binaural SoundNet: Predicting Semantics, Depth and Motion with Binaural Sounds

Humans can robustly recognize and localize objects by using visual and/o...
research
02/23/2023

Data leakage in cross-modal retrieval training: A case study

The recent progress in text-based audio retrieval was largely propelled ...
research
01/06/2021

Multi-Stage Residual Hiding for Image-into-Audio Steganography

The widespread application of audio communication technologies has speed...

Please sign up or login with your details

Forgot password? Click here to reset