Rethinking Cross-modal Interaction from a Top-down Perspective for Referring Video Object Segmentation

06/02/2021
by   Chen Liang, et al.
0

Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference. Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice. Such bottom-up strategy fails to explore object-level cues, easily leading to inferior results. In this work, we instead put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video. Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently. Our model ranks first place on CVPR2021 Referring Youtube-VOS challenge.

READ FULL TEXT
research
06/24/2022

The Second Place Solution for The 4th Large-scale Video Object Segmentation Challenge–Track 3: Referring Video Object Segmentation

The referring video object segmentation task (RVOS) aims to segment obje...
research
07/02/2022

Gaussian Kernel-based Cross Modal Network for Spatio-Temporal Video Grounding

Spatial-Temporal Video Grounding (STVG) is a challenging task which aims...
research
03/24/2020

Video Object Grounding using Semantic Roles in Language Description

We explore the task of Video Object Grounding (VOG), which grounds objec...
research
07/06/2022

STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic Cross-Modal Understanding

In this technical report, we introduce our solution to human-centric spa...
research
08/06/2021

Full-Duplex Strategy for Video Object Segmentation

Appearance and motion are two important sources of information in video ...
research
07/26/2022

Multi-Attention Network for Compressed Video Referring Object Segmentation

Referring video object segmentation aims to segment the object referred ...
research
07/02/2023

Referring Video Object Segmentation with Inter-Frame Interaction and Cross-Modal Correlation

Referring video object segmentation (RVOS) aims to segment the target ob...

Please sign up or login with your details

Forgot password? Click here to reset