Referring Video Object Segmentation with Inter-Frame Interaction and Cross-Modal Correlation

07/02/2023
by   Meng Lan, et al.
0

Referring video object segmentation (RVOS) aims to segment the target object in a video sequence described by a language expression. Typical query-based methods process the video sequence in a frame-independent manner to reduce the high computational cost, which however affects the performance due to the lack of inter-frame interaction for temporal coherence modeling and spatio-temporal representation learning of the referred object. Besides, they directly adopt the raw and high-level sentence feature as the language queries to decode the visual features, where the weak correlation between visual and linguistic features also increases the difficulty of decoding the target information and limits the performance of the model. In this paper, we proposes a novel RVOS framework, dubbed IFIRVOS, to address these issues. Specifically, we design a plug-and-play inter-frame interaction module in the Transformer decoder to efficiently learn the spatio-temporal features of the referred object, so as to decode the object information in the video sequence more precisely and generate more accurate segmentation results. Moreover, we devise the vision-language interaction module before the multimodal Transformer to enhance the correlation between the visual and linguistic features, thus facilitating the process of decoding object information from visual features by language queries in Transformer decoder and improving the segmentation performance. Extensive experimental results on three benchmarks validate the superiority of our IFIRVOS over state-of-the-art methods and the effectiveness of our proposed modules.

READ FULL TEXT

page 4

page 9

research
09/18/2023

Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation

Audio visual segmentation (AVS) aims to segment the sounding objects for...
research
07/26/2022

Multi-Attention Network for Compressed Video Referring Object Segmentation

Referring video object segmentation aims to segment the object referred ...
research
05/14/2021

Collaborative Spatial-Temporal Modeling for Language-Queried Video Actor Segmentation

Language-queried video actor segmentation aims to predict the pixel-leve...
research
03/30/2022

Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation

Referring video segmentation aims to segment the corresponding video obj...
research
06/02/2021

Rethinking Cross-modal Interaction from a Top-down Perspective for Referring Video Object Segmentation

Referring video object segmentation (RVOS) aims to segment video objects...
research
05/23/2023

Faster Video Moment Retrieval with Point-Level Supervision

Video Moment Retrieval (VMR) aims at retrieving the most relevant events...
research
10/06/2022

Video Referring Expression Comprehension via Transformer with Content-aware Query

Video Referring Expression Comprehension (REC) aims to localize a target...

Please sign up or login with your details

Forgot password? Click here to reset