Temporally Efficient Vision Transformer for Video Instance Segmentation

04/18/2022
by   Shusheng Yang, et al.
0

Recently vision transformer has achieved tremendous success on image-level visual recognition tasks. To effectively and efficiently model the crucial temporal information within a video clip, we propose a Temporally Efficient Vision Transformer (TeViT) for video instance segmentation (VIS). Different from previous transformer-based VIS methods, TeViT is nearly convolution-free, which contains a transformer backbone and a query-based video instance segmentation head. In the backbone stage, we propose a nearly parameter-free messenger shift mechanism for early temporal context fusion. In the head stages, we propose a parameter-shared spatiotemporal query interaction mechanism to build the one-to-one correspondence between video instances and queries. Thus, TeViT fully utilizes both framelevel and instance-level temporal context information and obtains strong temporal modeling capacity with negligible extra computational cost. On three widely adopted VIS benchmarks, i.e., YouTube-VIS-2019, YouTube-VIS-2021, and OVIS, TeViT obtains state-of-the-art results and maintains high inference speed, e.g., 46.6 AP with 68.9 FPS on YouTube-VIS-2019. Code is available at https://github.com/hustvl/TeViT.

READ FULL TEXT
research
12/15/2021

SeqFormer: a Frustratingly Simple Model for Video Instance Segmentation

In this work, we present SeqFormer, a frustratingly simple model for vid...
research
06/09/2022

VITA: Video Instance Segmentation via Object Token Association

We introduce a novel paradigm for offline Video Instance Segmentation (V...
research
09/21/2023

TCOVIS: Temporally Consistent Online Video Instance Segmentation

In recent years, significant progress has been made in video instance se...
research
07/05/2022

OSFormer: One-Stage Camouflaged Instance Segmentation with Transformers

We present OSFormer, the first one-stage transformer framework for camou...
research
11/16/2022

A Generalized Framework for Video Instance Segmentation

Recently, handling long videos of complex and occluded sequences has eme...
research
08/17/2022

Video-TransUNet: Temporally Blended Vision Transformer for CT VFSS Instance Segmentation

We propose Video-TransUNet, a deep architecture for instance segmentatio...
research
03/12/2022

One-stage Video Instance Segmentation: From Frame-in Frame-out to Clip-in Clip-out

Many video instance segmentation (VIS) methods partition a video sequenc...

Please sign up or login with your details

Forgot password? Click here to reset