A Unified Audio-Visual Learning Framework for Localization, Separation, and Recognition

05/30/2023
by   Shentong Mo, et al.
0

The ability to accurately recognize, localize and separate sound sources is fundamental to any audio-visual perception task. Historically, these abilities were tackled separately, with several methods developed independently for each task. However, given the interconnected nature of source localization, separation, and recognition, independent models are likely to yield suboptimal performance as they fail to capture the interdependence between these tasks. To address this problem, we propose a unified audio-visual learning framework (dubbed OneAVM) that integrates audio and visual cues for joint localization, separation, and recognition. OneAVM comprises a shared audio-visual encoder and task-specific decoders trained with three objectives. The first objective aligns audio and visual representations through a localized audio-visual correspondence loss. The second tackles visual source separation using a traditional mix-and-separate framework. Finally, the third objective reinforces visual feature separation and localization by mixing images in pixel space and aligning their representations with those of all corresponding sound sources. Extensive experiments on MUSIC, VGG-Instruments, VGG-Music, and VGGSound datasets demonstrate the effectiveness of OneAVM for all three tasks, audio-visual source localization, separation, and nearest neighbor recognition, and empirically demonstrate a strong positive transfer between them.

READ FULL TEXT

page 1

page 7

page 8

research
09/24/2021

Visual Scene Graphs for Audio Source Separation

State-of-the-art approaches for visually-guided audio source separation ...
research
10/29/2022

Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation

There exists an unequivocal distinction between the sound produced by a ...
research
08/07/2021

A Unified Model for Zero-shot Music Source Separation, Transcription and Synthesis

We propose a unified model for three inter-related tasks: 1) to separate...
research
04/20/2020

Music Gesture for Visual Sound Separation

Recent deep learning approaches have achieved impressive performance on ...
research
08/22/2019

Sound Localization and Separation in Three-dimensional Space Using a Single Microphone with a Metamaterial Enclosure

Conventional approaches to sound localization and separation are based o...
research
09/13/2023

Prompting Segmentation with Sound is Generalizable Audio-Visual Source Localizer

Never having seen an object and heard its sound simultaneously, can the ...
research
04/28/2021

AMSS-Net: Audio Manipulation on User-Specified Sources with Textual Queries

This paper proposes a neural network that performs audio transformations...

Please sign up or login with your details

Forgot password? Click here to reset