Visually Guided Sound Source Separation and Localization using Self-Supervised Motion Representations

04/17/2021
by   Lingyu Zhu, et al.
0

The objective of this paper is to perform audio-visual sound source separation, i.e. to separate component audios from a mixture based on the videos of sound sources. Moreover, we aim to pinpoint the source location in the input video sequence. Recent works have shown impressive audio-visual separation results when using prior knowledge of the source type (e.g. human playing instrument) and pre-trained motion detectors (e.g. keypoints or optical flows). However, at the same time, the models are limited to a certain application domain. In this paper, we address these limitations and make the following contributions: i) we propose a two-stage architecture, called Appearance and Motion network (AMnet), where the stages specialise to appearance and motion cues, respectively. The entire system is trained in a self-supervised manner; ii) we introduce an Audio-Motion Embedding (AME) framework to explicitly represent the motions that related to sound; iii) we propose an audio-motion transformer architecture for audio and motion feature fusion; iv) we demonstrate state-of-the-art performance on two challenging datasets (MUSIC-21 and AVE) despite the fact that we do not use any pre-trained keypoint detectors or optical flow estimators. Project page: https://ly-zhu.github.io/self-supervised-motion-representations

READ FULL TEXT

page 6

page 7

page 13

page 14

page 15

page 16

page 17

page 18

research
06/04/2020

Visually Guided Sound Source Separation using Cascaded Opponent Filter Network

The objective of this paper is to recover the original component signals...
research
04/20/2020

Music Gesture for Visual Sound Separation

Recent deep learning approaches have achieved impressive performance on ...
research
04/11/2019

The Sound of Motions

Sounds originate from object motions and vibrations of surrounding air. ...
research
07/09/2022

Learning to Separate Voices by Spatial Regions

We consider the problem of audio voice separation for binaural applicati...
research
07/15/2020

Separating Sounds from a Single Image

Recently, visual information has been widely used to aid the sound sourc...
research
08/16/2023

Improving Audio-Visual Segmentation with Bidirectional Generation

The aim of audio-visual segmentation (AVS) is to precisely differentiate...
research
06/19/2023

Visually-Guided Sound Source Separation with Audio-Visual Predictive Coding

The framework of visually-guided sound source separation generally consi...

Please sign up or login with your details

Forgot password? Click here to reset