AVSegFormer: Audio-Visual Segmentation with Transformer

07/03/2023
by   Shengyi Gao, et al.
0

The combination of audio and vision has long been a topic of interest in the multi-modal community. Recently, a new audio-visual segmentation (AVS) task has been introduced, aiming to locate and segment the sounding objects in a given video. This task demands audio-driven pixel-level scene understanding for the first time, posing significant challenges. In this paper, we propose AVSegFormer, a novel framework for AVS tasks that leverages the transformer architecture. Specifically, we introduce audio queries and learnable queries into the transformer decoder, enabling the network to selectively attend to interested visual features. Besides, we present an audio-visual mixer, which can dynamically adjust visual features by amplifying relevant and suppressing irrelevant spatial channels. Additionally, we devise an intermediate mask loss to enhance the supervision of the decoder, encouraging the network to produce more accurate intermediate predictions. Extensive experiments demonstrate that AVSegFormer achieves state-of-the-art results on the AVS benchmark. The code is available at https://github.com/vvvb-github/AVSegFormer.

READ FULL TEXT

page 1

page 7

page 8

research
09/18/2023

Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation

Audio visual segmentation (AVS) aims to segment the sounding objects for...
research
09/18/2023

CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation

Audio-visual video segmentation (AVVS) aims to generate pixel-level maps...
research
07/25/2023

Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation

The goal of the audio-visual segmentation (AVS) task is to segment the s...
research
12/22/2022

DDColor: Towards Photo-Realistic and Semantic-Aware Image Colorization via Dual Decoders

Automatic image colorization is a particularly challenging problem. Due ...
research
12/15/2021

Dense Video Captioning Using Unsupervised Semantic Information

We introduce a method to learn unsupervised semantic visual information ...
research
01/23/2020

Audiovisual SlowFast Networks for Video Recognition

We present Audiovisual SlowFast Networks, an architecture for integrated...
research
08/18/2023

Audio-Visual Glance Network for Efficient Video Recognition

Deep learning has made significant strides in video understanding tasks,...

Please sign up or login with your details

Forgot password? Click here to reset