A Closer Look at Audio-Visual Semantic Segmentation

04/06/2023
by   Chen Yuanhong, et al.
0

Audio-visual segmentation (AVS) is a complex task that involves accurately segmenting the corresponding sounding object based on audio-visual queries. Successful audio-visual learning requires two essential components: 1) an unbiased dataset with high-quality pixel-level multi-class labels, and 2) a model capable of effectively linking audio information with its corresponding visual object. However, these two requirements are only partially addressed by current methods, with training sets containing biased audio-visual data, and models that generalise poorly beyond this biased training set. In this work, we propose a new strategy to build cost-effective and relatively unbiased audio-visual semantic segmentation benchmarks. Our strategy, called Visual Post-production (VPO), explores the observation that it is not necessary to have explicit audio-visual pairs extracted from single video sources to build such benchmarks. We also refine the previously proposed AVSBench to transform it into the audio-visual semantic segmentation benchmark AVSBench-Single+. Furthermore, this paper introduces a new pixel-wise audio-visual contrastive learning method to enable a better generalisation of the model beyond the training set. We verify the validity of the VPO strategy by showing that state-of-the-art (SOTA) models trained with datasets built by matching audio and visual data from different sources or with datasets containing audio and visual data from the same video source produce almost the same accuracy. Then, using the proposed VPO benchmarks and AVSBench-Single+, we show that our method produces more accurate audio-visual semantic segmentation than SOTA models. Code and dataset will be available.

READ FULL TEXT

page 1

page 3

page 5

page 6

page 7

research
09/18/2023

Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation

Audio visual segmentation (AVS) aims to segment the sounding objects for...
research
05/12/2023

Hear to Segment: Unmixing the Audio to Guide the Semantic Segmentation

In this paper, we focus on a recently proposed novel task called Audio-V...
research
09/13/2023

Leveraging Foundation models for Unsupervised Audio-Visual Segmentation

Audio-Visual Segmentation (AVS) aims to precisely outline audible object...
research
05/18/2023

Annotation-free Audio-Visual Segmentation

The objective of Audio-Visual Segmentation (AVS) is to localise the soun...
research
12/04/2018

Improving Semantic Segmentation via Video Propagation and Label Relaxation

Semantic segmentation requires large amounts of pixel-wise annotations t...
research
05/03/2023

AV-SAM: Segment Anything Model Meets Audio-Visual Localization and Segmentation

Segment Anything Model (SAM) has recently shown its powerful effectivene...
research
08/03/2022

Estimating Visual Information From Audio Through Manifold Learning

We propose a new framework for extracting visual information about a sce...

Please sign up or login with your details

Forgot password? Click here to reset