Time-Space Transformers for Video Panoptic Segmentation

10/07/2022
by   Andra Petrovai, et al.
0

We propose a novel solution for the task of video panoptic segmentation, that simultaneously predicts pixel-level semantic and instance segmentation and generates clip-level instance tracks. Our network, named VPS-Transformer, with a hybrid architecture based on the state-of-the-art panoptic segmentation network Panoptic-DeepLab, combines a convolutional architecture for single-frame panoptic segmentation and a novel video module based on an instantiation of the pure Transformer block. The Transformer, equipped with attention mechanisms, models spatio-temporal relations between backbone output features of current and past frames for more accurate and consistent panoptic estimates. As the pure Transformer block introduces large computation overhead when processing high resolution images, we propose a few design changes for a more efficient compute. We study how to aggregate information more effectively over the space-time volume and we compare several variants of the Transformer block with different attention schemes. Extensive experiments on the Cityscapes-VPS dataset demonstrate that our best model improves the temporal consistency and video panoptic quality by a margin of 2.2 computation.

READ FULL TEXT

page 3

page 4

page 7

research
03/24/2022

Video Instance Segmentation via Multi-scale Spatio-temporal Split Attention Transformer

State-of-the-art transformer-based video instance segmentation (VIS) app...
research
09/21/2023

PanoVOS:Bridging Non-panoramic and Panoramic Views with Transformer for Video Segmentation

Panoramic videos contain richer spatial information and have attracted t...
research
03/16/2023

MATIS: Masked-Attention Transformers for Surgical Instrument Segmentation

We propose Masked-Attention Transformers for Surgical Instrument Segment...
research
07/13/2022

Entry-Flipped Transformer for Inference and Prediction of Participant Behavior

Some group activities, such as team sports and choreographed dances, inv...
research
06/10/2021

Space-time Mixing Attention for Video Transformer

This paper is on video recognition using Transformers. Very recent attem...
research
12/09/2022

MIMO Is All You Need : A Strong Multi-In-Multi-Out Baseline for Video Prediction

The mainstream of the existing approaches for video prediction builds up...

Please sign up or login with your details

Forgot password? Click here to reset