Improving Audio-Visual Segmentation with Bidirectional Generation

08/16/2023
by   Dawei Hao, et al.
0

The aim of audio-visual segmentation (AVS) is to precisely differentiate audible objects within videos down to the pixel level. Traditional approaches often tackle this challenge by combining information from various modalities, where the contribution of each modality is implicitly or explicitly modeled. Nevertheless, the interconnections between different modalities tend to be overlooked in audio-visual modeling. In this paper, inspired by the human ability to mentally simulate the sound of an object and its visual appearance, we introduce a bidirectional generation framework. This framework establishes robust correlations between an object's visual characteristics and its associated sound, thereby enhancing the performance of AVS. To achieve this, we employ a visual-to-audio projection component that reconstructs audio features from object segmentation masks and minimizes reconstruction errors. Moreover, recognizing that many sounds are linked to object movements, we introduce an implicit volumetric motion estimation module to handle temporal dynamics that may be challenging to capture using conventional optical flow methods. To showcase the effectiveness of our approach, we conduct comprehensive experiments and analyses on the widely recognized AVSBench benchmark. As a result, we establish a new state-of-the-art performance level in the AVS benchmark, particularly excelling in the challenging MS3 subset which involves segmenting multiple sound sources. To facilitate reproducibility, we plan to release both the source code and the pre-trained model.

READ FULL TEXT

page 3

page 5

page 6

research
11/06/2022

Hear The Flow: Optical Flow-Based Self-Supervised Visual Sound Source Localization

Learning to localize the sound source in videos without explicit annotat...
research
04/17/2021

Visually Guided Sound Source Separation and Localization using Self-Supervised Motion Representations

The objective of this paper is to perform audio-visual sound source sepa...
research
11/15/2022

FlowGrad: Using Motion for Visual Sound Source Localization

Most recent work in visual sound source localization relies on semantic ...
research
05/03/2023

AV-SAM: Segment Anything Model Meets Audio-Visual Localization and Segmentation

Segment Anything Model (SAM) has recently shown its powerful effectivene...
research
09/18/2023

CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation

Audio-visual video segmentation (AVVS) aims to generate pixel-level maps...
research
09/13/2023

Prompting Segmentation with Sound is Generalizable Audio-Visual Source Localizer

Never having seen an object and heard its sound simultaneously, can the ...
research
05/18/2023

Annotation-free Audio-Visual Segmentation

The objective of Audio-Visual Segmentation (AVS) is to localise the soun...

Please sign up or login with your details

Forgot password? Click here to reset