Video Instance Segmentation using Inter-Frame Communication Transformers

06/07/2021
by   Sukjun Hwang, et al.
0

We propose a novel end-to-end solution for video instance segmentation (VIS) based on transformers. Recently, the per-clip pipeline shows superior performance over per-frame methods leveraging richer information from multiple frames. However, previous per-clip models require heavy computation and memory usage to achieve frame-to-frame communications, limiting practicality. In this work, we propose Inter-frame Communication Transformers (IFC), which significantly reduces the overhead for information-passing between frames by efficiently encoding the context within the input clip. Specifically, we propose to utilize concise memory tokens as a mean of conveying information as well as summarizing each frame scene. The features of each frame are enriched and correlated with other frames through exchange of information between the precisely encoded memory tokens. We validate our method on the latest benchmark sets and achieved the state-of-the-art performance (AP 44.6 on YouTube-VIS 2019 val set using the offline inference) while having a considerably fast runtime (89.4 FPS). Our method can also be applied to near-online inference for processing a video in real-time with only a small delay. The code will be made available.

READ FULL TEXT

page 3

page 9

research
10/30/2022

Two-Level Temporal Relation Model for Online Video Instance Segmentation

In Video Instance Segmentation (VIS), current approaches either focus on...
research
04/13/2021

Crossover Learning for Fast Online Video Instance Segmentation

Modeling temporal visual context across frames is critical for video ins...
research
11/30/2020

End-to-End Video Instance Segmentation with Transformers

Video instance segmentation (VIS) is the task that requires simultaneous...
research
08/29/2023

NOVIS: A Case for End-to-End Near-Online Video Instance Segmentation

Until recently, the Video Instance Segmentation (VIS) community operated...
research
07/22/2022

DeVIS: Making Deformable Transformers Work for Video Instance Segmentation

Video Instance Segmentation (VIS) jointly tackles multi-object detection...
research
12/08/2021

VISOLO: Grid-Based Space-Time Aggregation for Efficient Online Video Instance Segmentation

For online video instance segmentation (VIS), fully utilizing the inform...
research
08/18/2021

End-to-End License Plate Recognition Pipeline for Real-time Low Resource Video Based Applications

Automatic License Plate Recognition systems aim to provide an end-to-end...

Please sign up or login with your details

Forgot password? Click here to reset