ViTEraser: Harnessing the Power of Vision Transformers for Scene Text Removal with SegMIM Pretraining

by   Dezhi Peng, et al.

Scene text removal (STR) aims at replacing text strokes in natural scenes with visually coherent backgrounds. Recent STR approaches rely on iterative refinements or explicit text masks, resulting in higher complexity and sensitivity to the accuracy of text localization. Moreover, most existing STR methods utilize convolutional neural networks (CNNs) for feature representation while the potential of vision Transformers (ViTs) remains largely unexplored. In this paper, we propose a simple-yet-effective ViT-based text eraser, dubbed ViTEraser. Following a concise encoder-decoder framework, different types of ViTs can be easily integrated into ViTEraser to enhance the long-range dependencies and global reasoning. Specifically, the encoder hierarchically maps the input image into the hidden space through ViT blocks and patch embedding layers, while the decoder gradually upsamples the hidden features to the text-erased image with ViT blocks and patch splitting layers. As ViTEraser implicitly integrates text localization and inpainting, we propose a novel end-to-end pretraining method, termed SegMIM, which focuses the encoder and decoder on the text box segmentation and masked image modeling tasks, respectively. To verify the effectiveness of the proposed methods, we comprehensively explore the architecture, pretraining, and scalability of the ViT-based encoder-decoder for STR, which provides deep insights into the application of ViT to STR. Experimental results demonstrate that ViTEraser with SegMIM achieves state-of-the-art performance on STR by a substantial margin. Furthermore, the extended experiment on tampered scene text detection demonstrates the generality of ViTEraser to other tasks. We believe this paper can inspire more research on ViT-based STR approaches. Code will be available at


page 1

page 4

page 6

page 8

page 10

page 11


FETNet: Feature Erasing and Transferring Network for Scene Text Removal

The scene text removal (STR) task aims to remove text regions and recove...

MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining

In this paper, we present a model pretraining technique, named MaskOCR, ...

Masked Vision-Language Transformers for Scene Text Recognition

Scene text recognition (STR) enables computers to recognize and read the...

Non-locally Enhanced Encoder-Decoder Network for Single Image De-raining

Single image rain streaks removal has recently witnessed substantial pro...

OadTR: Online Action Detection with Transformers

Most recent approaches for online action detection tend to apply Recurre...

Visual Parser: Representing Part-whole Hierarchies with Transformers

Human vision is able to capture the part-whole hierarchical information ...

Open Scene Understanding: Grounded Situation Recognition Meets Segment Anything for Helping People with Visual Impairments

Grounded Situation Recognition (GSR) is capable of recognizing and inter...

Please sign up or login with your details

Forgot password? Click here to reset