Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation

06/08/2022
by   Zihan Ding, et al.
0

Referring video object segmentation aims to predict foreground labels for objects referred by natural language expressions in videos. Previous methods either depend on 3D ConvNets or incorporate additional 2D ConvNets as encoders to extract mixed spatial-temporal features. However, these methods suffer from spatial misalignment or false distractors due to delayed and implicit spatial-temporal interaction occurring in the decoding phase. To tackle these limitations, we propose a Language-Bridged Duplex Transfer (LBDT) module which utilizes language as an intermediary bridge to accomplish explicit and adaptive spatial-temporal interaction earlier in the encoding phase. Concretely, cross-modal attention is performed among the temporal encoder, referring words and the spatial encoder to aggregate and transfer language-relevant motion and appearance information. In addition, we also propose a Bilateral Channel Activation (BCA) module in the decoding phase for further denoising and highlighting the spatial-temporal consistent features via channel-wise activation. Extensive experiments show our method achieves new state-of-the-art performances on four popular benchmarks with 6.8 A2D Sentences and J-HMDB Sentences respectively, while consuming around 7x less computational overhead.

READ FULL TEXT

page 1

page 3

page 8

research
05/14/2021

Collaborative Spatial-Temporal Modeling for Language-Queried Video Actor Segmentation

Language-queried video actor segmentation aims to predict the pixel-leve...
research
08/06/2021

Full-Duplex Strategy for Video Object Segmentation

Appearance and motion are two important sources of information in video ...
research
07/16/2018

Spatial-Temporal Synergic Residual Learning for Video Person Re-Identification

We tackle the problem of person re-identification in video setting in th...
research
01/04/2018

Object Referring in Videos with Language and Human Gaze

We investigate the problem of object referring (OR) i.e. to localize a t...
research
02/22/2022

Exploiting long-term temporal dynamics for video captioning

Automatically describing videos with natural language is a fundamental c...
research
06/17/2023

Multi-scale Spatial-temporal Interaction Network for Video Anomaly Detection

Video anomaly detection (VAD) is an essential yet challenge task in sign...
research
07/04/2022

Identifying Rhythmic Patterns for Face Forgery Detection and Categorization

With the emergence of GAN, face forgery technologies have been heavily a...

Please sign up or login with your details

Forgot password? Click here to reset