Fully Transformer-Equipped Architecture for End-to-End Referring Video Object Segmentation

09/21/2023
by   Ping Li, et al.
0

Referring Video Object Segmentation (RVOS) requires segmenting the object in video referred by a natural language query. Existing methods mainly rely on sophisticated pipelines to tackle such cross-modal task, and do not explicitly model the object-level spatial context which plays an important role in locating the referred object. Therefore, we propose an end-to-end RVOS framework completely built upon transformers, termed Fully Transformer-Equipped Architecture (FTEA), which treats the RVOS task as a mask sequence learning problem and regards all the objects in video as candidate objects. Given a video clip with a text query, the visual-textual features are yielded by encoder, while the corresponding pixel-level and word-level features are aligned in terms of semantic similarity. To capture the object-level spatial context, we have developed the Stacked Transformer, which individually characterizes the visual appearance of each candidate object, whose feature map is decoded to the binary mask sequence in order directly. Finally, the model finds the best matching between mask sequence and text query. In addition, to diversify the generated masks for candidate objects, we impose a diversity loss on the model for capturing more accurate mask of the referred object. Empirical studies have shown the superiority of the proposed method on three benchmarks, e.g., FETA achieves 45.1 videos) and J-HMDB Sentences (928 videos), respectively; it achieves 56.6 terms of 𝒥&ℱ on Ref-YouTube-VOS (3975 videos and 7451 objects). Particularly, compared to the best candidate method, it has a gain of 2.1 3.2 of 2.9

READ FULL TEXT

page 17

page 18

page 19

research
11/29/2021

End-to-End Referring Video Object Segmentation with Multimodal Transformers

The referring video object segmentation task (RVOS) involves segmentatio...
research
01/03/2022

Language as Queries for Referring Video Object Segmentation

Referring video object segmentation (R-VOS) is an emerging cross-modal t...
research
11/30/2020

End-to-End Video Instance Segmentation with Transformers

Video instance segmentation (VIS) is the task that requires simultaneous...
research
12/04/2022

CoupAlign: Coupling Word-Pixel with Sentence-Mask Alignments for Referring Image Segmentation

Referring image segmentation aims at localizing all pixels of the visual...
research
06/14/2023

LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation

Referring video object segmentation (RVOS) aims to segment the target in...
research
03/19/2020

Foldover Features for Dynamic Object Behavior Description in Microscopic Videos

Behavior description is conducive to the analysis of tiny objects, simil...
research
01/19/2017

FusionSeg: Learning to combine motion and appearance for fully automatic segmention of generic objects in videos

We propose an end-to-end learning framework for segmenting generic objec...

Please sign up or login with your details

Forgot password? Click here to reset