DeepAI AI Chat
Log In Sign Up

MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining

by   Pengyuan Lyu, et al.

In this paper, we present a model pretraining technique, named MaskOCR, for text recognition. Our text recognition architecture is an encoder-decoder transformer: the encoder extracts the patch-level representations, and the decoder recognizes the text from the representations. Our approach pretrains both the encoder and the decoder in a sequential manner. (i) We pretrain the encoder in a self-supervised manner over a large set of unlabeled real text images. We adopt the masked image modeling approach, which shows the effectiveness for general images, expecting that the representations take on semantics. (ii) We pretrain the decoder over a large set of synthesized text images in a supervised manner and enhance the language modeling capability of the decoder by randomly masking some text image patches occupied by characters input to the encoder and accordingly the representations input to the decoder. Experiments show that the proposed MaskOCR approach achieves superior results on the benchmark datasets, including Chinese and English text images.


page 1

page 2

page 3

page 4


ViT5: Pretrained Text-to-Text Transformer for Vietnamese Language Generation

We present ViT5, a pretrained Transformer-based encoder-decoder model fo...

StegaPos: Preventing Crops and Splices with Imperceptible Positional Encodings

We present a model for differentiating between images that are authentic...

Energy-Inspired Self-Supervised Pretraining for Vision Models

Motivated by the fact that forward and backward passes of a deep network...

Group DETR v2: Strong Object Detector with Encoder-Decoder Pretraining

We present a strong object detector with encoder-decoder pretraining and...

Utilizing Resource-Rich Language Datasets for End-to-End Scene Text Recognition in Resource-Poor Languages

This paper presents a novel training method for end-to-end scene text re...

cycle text2face: cycle text-to-face gan via transformers

Text-to-face is a subset of text-to-image that require more complex arch...

Channel-Aware Pretraining of Joint Encoder-Decoder Self-Supervised Model for Telephonic-Speech ASR

This paper proposes a novel technique to obtain better downstream ASR pe...