EPCFormer: Expression Prompt Collaboration Transformer for Universal Referring Video Object Segmentation

08/08/2023
by   Jiajun Chen, et al.
0

Audio-guided Video Object Segmentation (A-VOS) and Referring Video Object Segmentation (R-VOS) are two highly-related tasks, which both aim to segment specific objects from video sequences according to user-provided expression prompts. However, due to the challenges in modeling representations for different modalities, contemporary methods struggle to strike a balance between interaction flexibility and high-precision localization and segmentation. In this paper, we address this problem from two perspectives: the alignment representation of audio and text and the deep interaction among audio, text, and visual features. First, we propose a universal architecture, the Expression Prompt Collaboration Transformer, herein EPCFormer. Next, we propose an Expression Alignment (EA) mechanism for audio and text expressions. By introducing contrastive learning for audio and text expressions, the proposed EPCFormer realizes comprehension of the semantic equivalence between audio and text expressions denoting the same objects. Then, to facilitate deep interactions among audio, text, and video features, we introduce an Expression-Visual Attention (EVA) mechanism. The knowledge of video object segmentation in terms of the expression prompts can seamlessly transfer between the two tasks by deeply exploring complementary cues between text and audio. Experiments on well-recognized benchmarks demonstrate that our universal EPCFormer attains state-of-the-art results on both tasks. The source code of EPCFormer will be made publicly available at https://github.com/lab206/EPCFormer.

READ FULL TEXT

page 1

page 7

page 8

page 9

research
09/18/2023

Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation

Audio visual segmentation (AVS) aims to segment the sounding objects for...
research
08/31/2023

3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation

In 3D Referring Expression Segmentation (3D-RES), the earlier approach a...
research
10/08/2021

Phone-to-audio alignment without text: A Semi-supervised Approach

The task of phone-to-audio alignment has many applications in speech res...
research
06/14/2023

LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation

Referring video object segmentation (RVOS) aims to segment the target in...
research
09/18/2023

CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation

Audio-visual video segmentation (AVVS) aims to generate pixel-level maps...
research
08/03/2019

Searching for Ambiguous Objects in Videos using Relational Referring Expressions

Humans frequently use referring (identifying) expressions to refer to ob...
research
08/26/2023

Beyond One-to-One: Rethinking the Referring Image Segmentation

Referring image segmentation aims to segment the target object referred ...

Please sign up or login with your details

Forgot password? Click here to reset