Vision-Language Pre-Training for Boosting Scene Text Detectors

04/29/2022
by   Sibo Song, et al.
0

Recently, vision-language joint representation learning has proven to be highly effective in various scenarios. In this paper, we specifically adapt vision-language joint learning for scene text detection, a task that intrinsically involves cross-modal interaction between the two modalities: vision and language, since text is the written form of language. Concretely, we propose to learn contextualized, joint representations through vision-language pre-training, for the sake of enhancing the performance of scene text detectors. Towards this end, we devise a pre-training architecture with an image encoder, a text encoder and a cross-modal encoder, as well as three pretext tasks: image-text contrastive learning (ITC), masked language modeling (MLM) and word-in-image prediction (WIP). The pre-trained model is able to produce more informative representations with richer semantics, which could readily benefit existing scene text detectors (such as EAST and PSENet) in the down-stream text detection task. Extensive experiments on standard benchmarks demonstrate that the proposed paradigm can significantly improve the performance of various representative text detectors, outperforming previous pre-training approaches. The code and pre-trained models will be publicly released.

READ FULL TEXT

page 3

page 8

research
03/14/2021

SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple Levels

Vision-language pre-training (VLP) on large-scale image-text pairs has r...
research
06/27/2022

Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding

Multi-modal document pre-trained models have proven to be very effective...
research
06/30/2020

ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph

We propose a knowledge-enhanced approach, ERNIE-ViL, to learn joint repr...
research
03/18/2022

Graph-Text Multi-Modal Pre-training for Medical Representation Learning

As the volume of Electronic Health Records (EHR) sharply grows, there ha...
research
03/08/2022

Language Matters: A Weakly Supervised Pre-training Approach for Scene Text Detection and Spotting

Recently, Vision-Language Pre-training (VLP) techniques have greatly ben...
research
05/18/2023

CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-training

Improving text representation has attracted much attention to achieve ex...
research
07/04/2022

Explore Faster Localization Learning For Scene Text Detection

Generally pre-training and long-time training computation are necessary ...

Please sign up or login with your details

Forgot password? Click here to reset