SimpleClick: Interactive Image Segmentation with Simple Vision Transformers

10/20/2022
by   Qin Liu, et al.
13

Click-based interactive image segmentation aims at extracting objects with limited user clicking. Hierarchical backbone is the de-facto architecture for current methods. Recently, the plain, non-hierarchical Vision Transformer (ViT) has emerged as a competitive backbone for dense prediction tasks. This design allows the original ViT to be a foundation model that can be finetuned for the downstream task without redesigning a hierarchical backbone for pretraining. Although this design is simple and has been proven effective, it has not yet been explored for interactive segmentation. To fill this gap, we propose the first plain-backbone method, termed as SimpleClick due to its simplicity in architecture, for interactive segmentation. With the plain backbone pretrained as masked autoencoder (MAE), SimpleClick achieves state-of-the-art performance without bells and whistles. Remarkably, our method achieves 4.15 NoC@90 on SBD, improving 21.8 images highlights the generalizability of our method. We also provide a detailed computation analysis for our method, highlighting its availability as a practical annotation tool.

READ FULL TEXT

page 3

page 8

research
12/21/2021

iSegFormer: Interactive Image Segmentation with Transformers

We propose iSegFormer, a novel transformer-based approach for interactiv...
research
07/05/2023

Interactive Image Segmentation with Cross-Modality Vision Transformers

Interactive image segmentation aims to segment the target from the backg...
research
09/20/2021

EdgeFlow: Achieving Practical Interactive Segmentation with Edge-Guided Flow

High-quality training data play a key role in image segmentation tasks. ...
research
03/29/2023

Multi-scale Hierarchical Vision Transformer with Cascaded Attention Decoding for Medical Image Segmentation

Transformers have shown great success in medical image segmentation. How...
research
09/08/2021

Panoptic SegFormer

We present Panoptic SegFormer, a general framework for end-to-end panopt...
research
07/06/2023

A Critical Look at the Current Usage of Foundation Model for Dense Recognition Task

In recent years large model trained on huge amount of cross-modality dat...
research
05/24/2023

ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers

Recently, plain vision Transformers (ViTs) have shown impressive perform...

Please sign up or login with your details

Forgot password? Click here to reset