Turning a CLIP Model into a Scene Text Detector

02/28/2023
by   Wenwen Yu, et al.
0

The recent large-scale Contrastive Language-Image Pretraining (CLIP) model has shown great potential in various downstream tasks via leveraging the pretrained vision and language knowledge. Scene text, which contains rich textual and visual information, has an inherent connection with a model like CLIP. Recently, pretraining approaches based on vision language models have made effective progresses in the field of text detection. In contrast to these works, this paper proposes a new method, termed TCM, focusing on Turning the CLIP Model directly for text detection without pretraining process. We demonstrate the advantages of the proposed TCM as follows: (1) The underlying principle of our framework can be applied to improve existing scene text detector. (2) It facilitates the few-shot training capability of existing methods, e.g., by using 10 performance of the baseline method with an average of 22 F-measure on 4 benchmarks. (3) By turning the CLIP model into existing scene text detection methods, we further achieve promising domain adaptation ability. The code will be publicly released at https://github.com/wenwenyu/TCM.

READ FULL TEXT

page 7

page 8

page 14

page 15

page 16

page 17

research
03/29/2022

LinkBERT: Pretraining Language Models with Document Links

Language model (LM) pretraining can learn various knowledge from text co...
research
08/21/2023

Turning a CLIP Model into a Scene Text Spotter

We exploit the potential of the large-scale Contrastive Language-Image P...
research
12/19/2022

Transferring General Multimodal Pretrained Models to Text Recognition

This paper proposes a new method, OFA-OCR, to transfer multimodal pretra...
research
08/25/2022

A Compact Pretraining Approach for Neural Language Models

Domain adaptation for large neural language models (NLMs) is coupled wit...
research
06/26/2023

ParameterNet: Parameters Are All You Need for Large-scale Visual Pretraining of Mobile Networks

The large-scale visual pretraining has significantly improve the perform...
research
05/20/2023

Patton: Language Model Pretraining on Text-Rich Networks

A real-world text corpus sometimes comprises not only text documents but...
research
06/09/2023

DocumentCLIP: Linking Figures and Main Body Text in Reflowed Documents

Vision-language pretraining models have achieved great success in suppor...

Please sign up or login with your details

Forgot password? Click here to reset