Multi-Attention Network for Compressed Video Referring Object Segmentation

07/26/2022
by   Weidong Chen, et al.
0

Referring video object segmentation aims to segment the object referred by a given language expression. Existing works typically require compressed video bitstream to be decoded to RGB frames before being segmented, which increases computation and storage requirements and ultimately slows the inference down. This may hamper its application in real-world computing resource limited scenarios, such as autonomous cars and drones. To alleviate this problem, in this paper, we explore the referring object segmentation task on compressed videos, namely on the original video data flow. Besides the inherent difficulty of the video referring object segmentation task itself, obtaining discriminative representation from compressed video is also rather challenging. To address this problem, we propose a multi-attention network which consists of dual-path dual-attention module and a query-based cross-modal Transformer module. Specifically, the dual-path dual-attention module is designed to extract effective representation from compressed data in three modalities, i.e., I-frame, Motion Vector and Residual. The query-based cross-modal Transformer firstly models the correlation between linguistic and visual modalities, and then the fused multi-modality features are used to guide object queries to generate a content-aware dynamic kernel and to predict final segmentation masks. Different from previous works, we propose to learn just one kernel, which thus removes the complicated post mask-matching procedure of existing methods. Extensive promising experimental results on three challenging datasets show the effectiveness of our method compared against several state-of-the-art methods which are proposed for processing RGB data. Source code is available at: https://github.com/DexiangHong/MANet.

READ FULL TEXT

page 4

page 7

research
07/02/2023

Referring Video Object Segmentation with Inter-Frame Interaction and Cross-Modal Correlation

Referring video object segmentation (RVOS) aims to segment the target ob...
research
07/25/2023

Spectrum-guided Multi-granularity Referring Video Object Segmentation

Current referring video object segmentation (R-VOS) techniques extract c...
research
09/05/2023

Learning Cross-Modal Affinity for Referring Video Object Segmentation Targeting Limited Samples

Referring video object segmentation (RVOS), as a supervised learning tas...
research
07/16/2023

CalibNet: Dual-branch Cross-modal Calibration for RGB-D Salient Instance Segmentation

We propose a novel approach for RGB-D salient instance segmentation usin...
research
06/02/2021

Rethinking Cross-modal Interaction from a Top-down Perspective for Referring Video Object Segmentation

Referring video object segmentation (RVOS) aims to segment video objects...
research
07/13/2022

Symmetry-Aware Transformer-based Mirror Detection

Mirror detection aims to identify the mirror regions in the given input ...
research
07/17/2023

Deficiency-Aware Masked Transformer for Video Inpainting

Recent video inpainting methods have made remarkable progress by utilizi...

Please sign up or login with your details

Forgot password? Click here to reset