LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation

06/14/2023
by   Linfeng Yuan, et al.
0

Referring video object segmentation (RVOS) aims to segment the target instance referred by a given text expression in a video clip. The text expression normally contains sophisticated descriptions of the instance's appearance, actions, and relations with others. It is therefore rather difficult for an RVOS model to capture all these attributes correspondingly in the video; in fact, the model often favours more on the action- and relation-related visual attribute of the instance. This can end up with incomplete or even incorrect mask prediction of the target instance. In this paper, we tackle this problem by taking a subject-centric short text expression from the original long text expression. The short one retains only the appearance-related information of the target instance so that we can use it to focus the model's attention on the instance's appearance. We let the model make joint predictions using both long and short text expressions and introduce a long-short predictions intersection loss to align the joint predictions. Besides the improvement on the linguistic part, we also introduce a forward-backward visual consistency loss, which utilizes optical flows to warp visual features between the annotated frames and their temporal neighbors for consistency. We build our method on top of two state of the art transformer-based pipelines for end-to-end training. Extensive experiments on A2D-Sentences and JHMDB-Sentences datasets show impressive improvements of our method.

READ FULL TEXT

page 1

page 8

page 12

page 13

research
11/29/2021

End-to-End Referring Video Object Segmentation with Multimodal Transformers

The referring video object segmentation task (RVOS) involves segmentatio...
research
08/08/2023

EPCFormer: Expression Prompt Collaboration Transformer for Universal Referring Video Object Segmentation

Audio-guided Video Object Segmentation (A-VOS) and Referring Video Objec...
research
09/21/2023

Fully Transformer-Equipped Architecture for End-to-End Referring Video Object Segmentation

Referring Video Object Segmentation (RVOS) requires segmenting the objec...
research
10/10/2019

Referring Expression Object Segmentation with Caption-Aware Consistency

Referring expressions are natural language descriptions that identify a ...
research
03/26/2018

Video Representation Learning Using Discriminative Pooling

Popular deep models for action recognition in videos generate independen...
research
09/21/2023

Efficient Long-Short Temporal Attention Network for Unsupervised Video Object Segmentation

Unsupervised Video Object Segmentation (VOS) aims at identifying the con...
research
12/05/2014

Integer Programming Ensemble of Classifiers for Temporal Relations

Extraction of events and understanding related temporal expression among...

Please sign up or login with your details

Forgot password? Click here to reset