Learning Cross-Modal Affinity for Referring Video Object Segmentation Targeting Limited Samples

09/05/2023
by   Guanghui Li, et al.
0

Referring video object segmentation (RVOS), as a supervised learning task, relies on sufficient annotated data for a given scene. However, in more realistic scenarios, only minimal annotations are available for a new scene, which poses significant challenges to existing RVOS methods. With this in mind, we propose a simple yet effective model with a newly designed cross-modal affinity (CMA) module based on a Transformer architecture. The CMA module builds multimodal affinity with a few samples, thus quickly learning new semantic information, and enabling the model to adapt to different scenarios. Since the proposed method targets limited samples for new scenes, we generalize the problem as - few-shot referring video object segmentation (FS-RVOS). To foster research in this direction, we build up a new FS-RVOS benchmark based on currently available datasets. The benchmark covers a wide range and includes multiple situations, which can maximally simulate real-world scenarios. Extensive experiments show that our model adapts well to different scenarios with only a few samples, reaching state-of-the-art performance on the benchmark. On Mini-Ref-YouTube-VOS, our model achieves an average performance of 53.1 J and 54.8 F, which are 10 show impressive results of 77.7 J and 74.8 F on Mini-Ref-SAIL-VOS, which are significantly better than the baselines. Code is publicly available at https://github.com/hengliusky/Few_shot_RVOS.

READ FULL TEXT
research
07/02/2019

Proposal, Tracking and Segmentation (PTS): A Cascaded Network for Video Object Segmentation

Video object segmentation (VOS) aims at pixel-level object tracking give...
research
07/26/2022

Multi-Attention Network for Compressed Video Referring Object Segmentation

Referring video object segmentation aims to segment the object referred ...
research
01/03/2022

Language as Queries for Referring Video Object Segmentation

Referring video object segmentation (R-VOS) is an emerging cross-modal t...
research
05/08/2023

Video Object Segmentation in Panoptic Wild Scenes

In this paper, we introduce semi-supervised video object segmentation (V...
research
08/14/2021

A Self-Distillation Embedded Supervised Affinity Attention Model for Few-Shot Segmentation

Few-shot semantic segmentation is a challenging task of predicting objec...
research
07/25/2023

Spectrum-guided Multi-granularity Referring Video Object Segmentation

Current referring video object segmentation (R-VOS) techniques extract c...
research
07/18/2023

OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation

Referring video object segmentation (RVOS) aims at segmenting an object ...

Please sign up or login with your details

Forgot password? Click here to reset