Visually-Guided Sound Source Separation with Audio-Visual Predictive Coding

06/19/2023
by   Zengjie Song, et al.
0

The framework of visually-guided sound source separation generally consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing. An ongoing trend in this field has been to tailor involved visual feature extractor for informative visual guidance and separately devise module for feature fusion, while utilizing U-Net by default for sound analysis. However, such divide-and-conquer paradigm is parameter inefficient and, meanwhile, may obtain suboptimal performance as jointly optimizing and harmonizing various model components is challengeable. By contrast, this paper presents a novel approach, dubbed audio-visual predictive coding (AVPC), to tackle this task in a parameter efficient and more effective manner. The network of AVPC features a simple ResNet-based video analysis network for deriving semantic visual features, and a predictive coding-based sound separation network that can extract audio features, fuse multimodal information, and predict sound separation masks in the same architecture. By iteratively minimizing the prediction error between features, AVPC integrates audio and visual information recursively, leading to progressively improved performance. In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source. Extensive evaluations demonstrate that AVPC outperforms several baselines in separating musical instrument sounds, while reducing the model size significantly. Code is available at: https://github.com/zjsong/Audio-Visual-Predictive-Coding.

READ FULL TEXT

page 1

page 5

page 10

page 11

page 14

research
03/25/2022

Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes

Sound source localization in visual scenes aims to localize objects emit...
research
07/20/2020

Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation

Stereophonic audio is an indispensable ingredient to enhance human audit...
research
04/18/2019

Self-Supervised Audio-Visual Co-Segmentation

Segmenting objects in images and separating sound sources in audio are c...
research
04/10/2018

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

The thud of a bouncing ball, the onset of speech as lips open -- when vi...
research
04/17/2021

Visually Guided Sound Source Separation and Localization using Self-Supervised Motion Representations

The objective of this paper is to perform audio-visual sound source sepa...
research
07/14/2020

Generating Visually Aligned Sound from Videos

We focus on the task of generating sound from natural videos, and the so...
research
09/15/2023

SSL-Net: A Synergistic Spectral and Learning-based Network for Efficient Bird Sound Classification

Efficient and accurate bird sound classification is of important for eco...

Please sign up or login with your details

Forgot password? Click here to reset