Collaborative Spatial-Temporal Modeling for Language-Queried Video Actor Segmentation

05/14/2021
by   Tianrui Hui, et al.
0

Language-queried video actor segmentation aims to predict the pixel-level mask of the actor which performs the actions described by a natural language query in the target frames. Existing methods adopt 3D CNNs over the video clip as a general encoder to extract a mixed spatio-temporal feature for the target frame. Though 3D convolutions are amenable to recognizing which actor is performing the queried actions, it also inevitably introduces misaligned spatial information from adjacent frames, which confuses features of the target frame and yields inaccurate segmentation. Therefore, we propose a collaborative spatial-temporal encoder-decoder framework which contains a 3D temporal encoder over the video clip to recognize the queried actions, and a 2D spatial encoder over the target frame to accurately segment the queried actors. In the decoder, a Language-Guided Feature Selection (LGFS) module is proposed to flexibly integrate spatial and temporal features from the two encoders. We also propose a Cross-Modal Adaptive Modulation (CMAM) module to dynamically recombine spatial- and temporal-relevant linguistic features for multimodal feature interaction in each stage of the two encoders. Our method achieves new state-of-the-art performance on two popular benchmarks with less computational overhead than previous approaches.

READ FULL TEXT

page 1

page 3

page 8

research
06/08/2022

Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation

Referring video object segmentation aims to predict foreground labels fo...
research
07/02/2023

Referring Video Object Segmentation with Inter-Frame Interaction and Cross-Modal Correlation

Referring video object segmentation (RVOS) aims to segment the target ob...
research
03/30/2022

Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation

Referring video segmentation aims to segment the corresponding video obj...
research
03/20/2018

Actor and Action Video Segmentation from a Sentence

This paper strives for pixel-level segmentation of actors and their acti...
research
07/02/2022

Gaussian Kernel-based Cross Modal Network for Spatio-Temporal Video Grounding

Spatial-Temporal Video Grounding (STVG) is a challenging task which aims...
research
08/14/2023

Orthogonal Temporal Interpolation for Zero-Shot Video Recognition

Zero-shot video recognition (ZSVR) is a task that aims to recognize vide...
research
08/09/2023

Histogram-guided Video Colorization Structure with Spatial-Temporal Connection

Video colorization, aiming at obtaining colorful and plausible results f...

Please sign up or login with your details

Forgot password? Click here to reset