Scalable Video Object Segmentation with Simplified Framework

08/19/2023
by   Qiangqiang Wu, et al.
0

The current popular methods for video object segmentation (VOS) implement feature matching through several hand-crafted modules that separately perform feature extraction and matching. However, the above hand-crafted designs empirically cause insufficient target interaction, thus limiting the dynamic target-aware feature learning in VOS. To tackle these limitations, this paper presents a scalable Simplified VOS (SimVOS) framework to perform joint feature extraction and matching by leveraging a single transformer backbone. Specifically, SimVOS employs a scalable ViT backbone for simultaneous feature extraction and matching between query and reference features. This design enables SimVOS to learn better target-ware features for accurate mask prediction. More importantly, SimVOS could directly apply well-pretrained ViT backbones (e.g., MAE) for VOS, which bridges the gap between VOS and large-scale self-supervised pre-training. To achieve a better performance-speed trade-off, we further explore within-frame attention and propose a new token refinement module to improve the running speed and save computational cost. Experimentally, our SimVOS achieves state-of-the-art results on popular video object segmentation benchmarks, i.e., DAVIS-2017 (88.0 J F) and YouTube-VOS 2019 (84.2 BL30K pre-training used in previous VOS approaches.

READ FULL TEXT

page 2

page 3

page 9

page 14

research
12/16/2021

Masked Feature Prediction for Self-Supervised Visual Pre-Training

We present Masked Feature Prediction (MaskFeat) for self-supervised pre-...
research
03/10/2022

Backbone is All Your Need: A Simplified Architecture for Visual Object Tracking

Exploiting a general-purpose neural architecture to replace hand-wired d...
research
08/25/2023

Joint Modeling of Feature, Correspondence, and a Compressed Memory for Video Object Segmentation

Current prevailing Video Object Segmentation (VOS) methods usually perfo...
research
06/04/2021

Associating Objects with Transformers for Video Object Segmentation

This paper investigates how to realize better and more efficient embeddi...
research
03/22/2022

Associating Objects with Scalable Transformers for Video Object Segmentation

This paper investigates how to realize better and more efficient embeddi...
research
10/15/2020

Video Object Segmentation with Adaptive Feature Bank and Uncertain-Region Refinement

We propose a new matching-based framework for semi-supervised video obje...
research
08/23/2023

Sign Language Translation with Iterative Prototype

This paper presents IP-SLT, a simple yet effective framework for sign la...

Please sign up or login with your details

Forgot password? Click here to reset