VITA: Video Instance Segmentation via Object Token Association

06/09/2022
by   Miran Heo, et al.
0

We introduce a novel paradigm for offline Video Instance Segmentation (VIS), based on the hypothesis that explicit object-oriented information can be a strong clue for understanding the context of the entire sequence. To this end, we propose VITA, a simple structure built on top of an off-the-shelf Transformer-based image instance segmentation model. Specifically, we use an image object detector as a means of distilling object-specific contexts into object tokens. VITA accomplishes video-level understanding by associating frame-level object tokens without using spatio-temporal backbone features. By effectively building relationships between objects using the condensed information, VITA achieves the state-of-the-art on VIS benchmarks with a ResNet-50 backbone: 49.8 AP, 45.7 AP on YouTube-VIS 2019 2021 and 19.6 AP on OVIS. Moreover, thanks to its object token-based structure that is disjoint from the backbone features, VITA shows several practical advantages that previous offline VIS methods have not explored - handling long and high-resolution videos with a common GPU and freezing a frame-level detector trained on image domain. Code will be made available at https://github.com/sukjunhwang/VITA.

READ FULL TEXT
research
12/15/2021

SeqFormer: a Frustratingly Simple Model for Video Instance Segmentation

In this work, we present SeqFormer, a frustratingly simple model for vid...
research
06/07/2023

RefineVIS: Video Instance Segmentation with Temporal Attention Refinement

We introduce a novel framework called RefineVIS for Video Instance Segme...
research
04/18/2022

Temporally Efficient Vision Transformer for Video Instance Segmentation

Recently vision transformer has achieved tremendous success on image-lev...
research
11/16/2022

A Generalized Framework for Video Instance Segmentation

Recently, handling long videos of complex and occluded sequences has eme...
research
03/12/2022

One-stage Video Instance Segmentation: From Frame-in Frame-out to Clip-in Clip-out

Many video instance segmentation (VIS) methods partition a video sequenc...
research
07/18/2023

OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation

Referring video object segmentation (RVOS) aims at segmenting an object ...
research
09/19/2022

A Simple and Powerful Global Optimization for Unsupervised Video Object Segmentation

We propose a simple, yet powerful approach for unsupervised object segme...

Please sign up or login with your details

Forgot password? Click here to reset