Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation

03/30/2022
by   Guang Feng, et al.
0

Referring video segmentation aims to segment the corresponding video object described by the language expression. To address this task, we first design a two-stream encoder to extract CNN-based visual features and transformer-based linguistic features hierarchically, and a vision-language mutual guidance (VLMG) module is inserted into the encoder multiple times to promote the hierarchical and progressive fusion of multi-modal features. Compared with the existing multi-modal fusion methods, this two-stream encoder takes into account the multi-granularity linguistic context, and realizes the deep interleaving between modalities with the help of VLGM. In order to promote the temporal alignment between frames, we further propose a language-guided multi-scale dynamic filtering (LMDF) module to strengthen the temporal coherence, which uses the language-guided spatial-temporal features to generate a set of position-specific dynamic filters to more flexibly and effectively update the feature of current frame. Extensive experiments on four datasets verify the effectiveness of the proposed model.

READ FULL TEXT

page 3

page 8

page 9

research
05/05/2021

Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation

Recently, referring image segmentation has aroused widespread interest. ...
research
06/16/2021

CMF: Cascaded Multi-model Fusion for Referring Image Segmentation

In this work, we address the task of referring image segmentation (RIS),...
research
06/18/2020

Language Guided Networks for Cross-modal Moment Retrieval

We address the challenging task of cross-modal moment retrieval, which a...
research
05/14/2021

Collaborative Spatial-Temporal Modeling for Language-Queried Video Actor Segmentation

Language-queried video actor segmentation aims to predict the pixel-leve...
research
07/25/2023

Spectrum-guided Multi-granularity Referring Video Object Segmentation

Current referring video object segmentation (R-VOS) techniques extract c...
research
07/02/2023

Referring Video Object Segmentation with Inter-Frame Interaction and Cross-Modal Correlation

Referring video object segmentation (RVOS) aims to segment the target ob...
research
06/26/2023

Mutual Query Network for Multi-Modal Product Image Segmentation

Product image segmentation is vital in e-commerce. Most existing methods...

Please sign up or login with your details

Forgot password? Click here to reset