InstanceFormer: An Online Video Instance Segmentation Framework

08/22/2022
by   Rajat Koner, et al.
1

Recent transformer-based offline video instance segmentation (VIS) approaches achieve encouraging results and significantly outperform online approaches. However, their reliance on the whole video and the immense computational complexity caused by full Spatio-temporal attention limit them in real-life applications such as processing lengthy videos. In this paper, we propose a single-stage transformer-based efficient online VIS framework named InstanceFormer, which is especially suitable for long and challenging videos. We propose three novel components to model short-term and long-term dependency and temporal coherence. First, we propagate the representation, location, and semantic information of prior instances to model short-term changes. Second, we propose a novel memory cross-attention in the decoder, which allows the network to look into earlier instances within a certain temporal window. Finally, we employ a temporal contrastive loss to impose coherence in the representation of an instance across all frames. Memory attention and temporal coherence are particularly beneficial to long-range dependency modeling, including challenging scenarios like occlusion. The proposed InstanceFormer outperforms previous online benchmark methods by a large margin across multiple datasets. Most importantly, InstanceFormer surpasses offline approaches for challenging and long datasets such as YouTube-VIS-2021 and OVIS. Code is available at https://github.com/rajatkoner08/InstanceFormer.

READ FULL TEXT

page 3

page 4

page 7

page 12

page 13

research
09/21/2023

TCOVIS: Temporally Consistent Online Video Instance Segmentation

In recent years, significant progress has been made in video instance se...
research
04/04/2022

TALLFormer: Temporal Action Localization with Long-memory Transformer

Most modern approaches in temporal action localization divide this probl...
research
04/12/2023

Adaptive Human Matting for Dynamic Videos

The most recent efforts in video matting have focused on eliminating tri...
research
05/26/2023

GRAtt-VIS: Gated Residual Attention for Auto Rectifying Video Instance Segmentation

Recent trends in Video Instance Segmentation (VIS) have seen a growing r...
research
08/15/2021

Exploring Temporal Coherence for More General Video Face Forgery Detection

Although current face manipulation techniques achieve impressive perform...
research
08/15/2023

Memory-and-Anticipation Transformer for Online Action Understanding

Most existing forecasting systems are memory-based methods, which attemp...
research
03/21/2023

3D Mitochondria Instance Segmentation with Spatio-Temporal Transformers

Accurate 3D mitochondria instance segmentation in electron microscopy (E...

Please sign up or login with your details

Forgot password? Click here to reset