Video Mask Transfiner for High-Quality Video Instance Segmentation

07/28/2022
by   Lei Ke, et al.
6

While Video Instance Segmentation (VIS) has seen rapid progress, current approaches struggle to predict high-quality masks with accurate boundary details. Moreover, the predicted segmentations often fluctuate over time, suggesting that temporal consistency cues are neglected or not fully utilized. In this paper, we set out to tackle these issues, with the aim of achieving highly detailed and more temporally stable mask predictions for VIS. We first propose the Video Mask Transfiner (VMT) method, capable of leveraging fine-grained high-resolution features thanks to a highly efficient video transformer structure. Our VMT detects and groups sparse error-prone spatio-temporal regions of each tracklet in the video segment, which are then refined using both local and instance-level cues. Second, we identify that the coarse boundary annotations of the popular YouTube-VIS dataset constitute a major limiting factor. Based on our VMT architecture, we therefore design an automated annotation refinement approach by iterative training and self-correction. To benchmark high-quality mask predictions for VIS, we introduce the HQ-YTVIS dataset, consisting of a manually re-annotated test set and our automatically refined training data. We compare VMT with the most recent state-of-the-art methods on the HQ-YTVIS, as well as the Youtube-VIS, OVIS and BDD100K MOTS benchmarks. Experimental results clearly demonstrate the efficacy and effectiveness of our method on segmenting complex and dynamic objects, by capturing precise details.

READ FULL TEXT

page 2

page 3

page 4

page 6

page 9

page 13

page 14

research
11/26/2021

Mask Transfiner for High-Quality Instance Segmentation

Two-stage and query-based instance segmentation methods have achieved re...
research
03/28/2023

Mask-Free Video Instance Segmentation

The recent advancement in Video Instance Segmentation (VIS) has largely ...
research
04/17/2021

RefineMask: Towards High-Quality Instance Segmentation with Fine-Grained Features

The two-stage methods for instance segmentation, e.g. Mask R-CNN, have a...
research
03/12/2022

One-stage Video Instance Segmentation: From Frame-in Frame-out to Clip-in Clip-out

Many video instance segmentation (VIS) methods partition a video sequenc...
research
11/19/2020

DCT-Mask: Discrete Cosine Transform Mask Representation for Instance Segmentation

Binary grid mask representation is broadly used in instance segmentation...
research
11/26/2020

The Devil is in the Boundary: Exploiting Boundary Representation for Basis-based Instance Segmentation

Pursuing a more coherent scene understanding towards real-time vision ap...

Please sign up or login with your details

Forgot password? Click here to reset