Language Matters: A Weakly Supervised Pre-training Approach for Scene Text Detection and Spotting

by   Chuhui Xue, et al.

Recently, Vision-Language Pre-training (VLP) techniques have greatly benefited various vision-language tasks by jointly learning visual and textual representations, which intuitively helps in Optical Character Recognition (OCR) tasks due to the rich visual and textual information in scene text images. However, these methods cannot well cope with OCR tasks because of the difficulty in both instance-level text encoding and image-text pair acquisition (i.e. images and captured texts in them). This paper presents a weakly supervised pre-training method that can acquire effective scene text representations by jointly learning and aligning visual and textual information. Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features, respectively, as well as a visual-textual decoder that models the interaction among textual and visual features for learning effective scene text representations. With the learning of textual features, the pre-trained model can attend texts in images well with character awareness. Besides, these designs enable the learning from weakly annotated texts (i.e. partial texts in images without text bounding boxes) which mitigates the data annotation constraint greatly. Experiments over the weakly annotated images in ICDAR2019-LSVT show that our pre-trained model improves F-score by +2.5 text detection and spotting networks, respectively. In addition, the proposed method outperforms existing pre-training techniques consistently across multiple public datasets (e.g., +3.2


page 2

page 4

page 5

page 6


Teaching the Pre-trained Model to Generate Simple Texts for Text Simplification

Randomly masking text spans in ordinary texts in the pre-training stage ...

Vision-Language Pre-Training for Boosting Scene Text Detectors

Recently, vision-language joint representation learning has proven to be...

Weakly Supervised Dataset Collection for Robust Person Detection

To construct an algorithm that can provide robust person detection, we p...

Art2Real: Unfolding the Reality of Artworks via Semantically-Aware Image-to-Image Translation

The applicability of computer vision to real paintings and artworks has ...

Deep Visual Template-Free Form Parsing

Automatic, template-free extraction of information from form images is c...

ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation

Vision-language pre-training (VLP) methods are blossoming recently, and ...

Selective Distillation of Weakly Annotated GTD for Vision-based Slab Identification System

This paper proposes an algorithm for recognizing slab identification num...

Please sign up or login with your details

Forgot password? Click here to reset