Leveraging Foundation models for Unsupervised Audio-Visual Segmentation

09/13/2023
by   Swapnil Bhosale, et al.
0

Audio-Visual Segmentation (AVS) aims to precisely outline audible objects in a visual scene at the pixel level. Existing AVS methods require fine-grained annotations of audio-mask pairs in supervised learning fashion. This limits their scalability since it is time consuming and tedious to acquire such cross-modality pixel level labels. To overcome this obstacle, in this work we introduce unsupervised audio-visual segmentation with no need for task-specific data annotations and model training. For tackling this newly proposed problem, we formulate a novel Cross-Modality Semantic Filtering (CMSF) approach to accurately associate the underlying audio-mask pairs by leveraging the off-the-shelf multi-modal foundation models (e.g., detection [1], open-world segmentation [2] and multi-modal alignment [3]). Guiding the proposal generation by either audio or visual cues, we design two training-free variants: AT-GDINO-SAM and OWOD-BIND. Extensive experiments on the AVS-Bench dataset show that our unsupervised approach can perform well in comparison to prior art supervised counterparts across complex scenarios with multiple auditory objects. Particularly, in situations where existing supervised AVS methods struggle with overlapping foreground objects, our models still excel in accurately segmenting overlapped auditory objects. Our code will be publicly released.

READ FULL TEXT

page 3

page 4

research
09/17/2018

DASNet: Reducing Pixel-level Annotations for Instance and Semantic Segmentation

Pixel-level annotation demands expensive human efforts and limits the pe...
research
04/06/2023

A Closer Look at Audio-Visual Semantic Segmentation

Audio-visual segmentation (AVS) is a complex task that involves accurate...
research
08/03/2022

Audio-visual scene classification via contrastive event-object alignment and semantic-based fusion

Previous works on scene classification are mainly based on audio or visu...
research
05/18/2023

Annotation-free Audio-Visual Segmentation

The objective of Audio-Visual Segmentation (AVS) is to localise the soun...
research
09/10/2023

Multimodal Fish Feeding Intensity Assessment in Aquaculture

Fish feeding intensity assessment (FFIA) aims to evaluate the intensity ...
research
04/13/2021

Visually Informed Binaural Audio Generation without Binaural Audios

Stereophonic audio, especially binaural audio, plays an essential role i...
research
07/11/2021

NeoUNet: Towards accurate colon polyp segmentation and neoplasm detection

Automatic polyp segmentation has proven to be immensely helpful for endo...

Please sign up or login with your details

Forgot password? Click here to reset