MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining

06/01/2022
by   Pengyuan Lyu, et al.
0

In this paper, we present a model pretraining technique, named MaskOCR, for text recognition. Our text recognition architecture is an encoder-decoder transformer: the encoder extracts the patch-level representations, and the decoder recognizes the text from the representations. Our approach pretrains both the encoder and the decoder in a sequential manner. (i) We pretrain the encoder in a self-supervised manner over a large set of unlabeled real text images. We adopt the masked image modeling approach, which shows the effectiveness for general images, expecting that the representations take on semantics. (ii) We pretrain the decoder over a large set of synthesized text images in a supervised manner and enhance the language modeling capability of the decoder by randomly masking some text image patches occupied by characters input to the encoder and accordingly the representations input to the decoder. Experiments show that the proposed MaskOCR approach achieves superior results on the benchmark datasets, including Chinese and English text images.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/13/2022

ViT5: Pretrained Text-to-Text Transformer for Vietnamese Language Generation

We present ViT5, a pretrained Transformer-based encoder-decoder model fo...
research
04/25/2021

StegaPos: Preventing Crops and Splices with Imperceptible Positional Encodings

We present a model for differentiating between images that are authentic...
research
06/21/2023

ViTEraser: Harnessing the Power of Vision Transformers for Scene Text Removal with SegMIM Pretraining

Scene text removal (STR) aims at replacing text strokes in natural scene...
research
02/02/2023

Energy-Inspired Self-Supervised Pretraining for Vision Models

Motivated by the fact that forward and backward passes of a deep network...
research
11/07/2022

Group DETR v2: Strong Object Detector with Encoder-Decoder Pretraining

We present a strong object detector with encoder-decoder pretraining and...
research
05/22/2020

SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition

Scene text recognition is a hot research topic in computer vision. Recen...
research
11/24/2021

Utilizing Resource-Rich Language Datasets for End-to-End Scene Text Recognition in Resource-Poor Languages

This paper presents a novel training method for end-to-end scene text re...

Please sign up or login with your details

Forgot password? Click here to reset