Turning a CLIP Model into a Scene Text Spotter

08/21/2023
by   Wenwen Yu, et al.
0

We exploit the potential of the large-scale Contrastive Language-Image Pretraining (CLIP) model to enhance scene text detection and spotting tasks, transforming it into a robust backbone, FastTCM-CR50. This backbone utilizes visual prompt learning and cross-attention in CLIP to extract image and text-based prior knowledge. Using predefined and learnable prompts, FastTCM-CR50 introduces an instance-language matching process to enhance the synergy between image and text embeddings, thereby refining text regions. Our Bimodal Similarity Matching (BSM) module facilitates dynamic language prompt generation, enabling offline computations and improving performance. FastTCM-CR50 offers several advantages: 1) It can enhance existing text detectors and spotters, improving performance by an average of 1.7 respectively. 2) It outperforms the previous TCM-CR50 backbone, yielding an average improvement of 0.2 along with a 48.5 training capabilities. Utilizing only 10 improves performance by an average of 26.5 spotting tasks, respectively. 4) It consistently enhances performance on out-of-distribution text detection and spotting datasets, particularly the NightTime-ArT subset from ICDAR2019-ArT and the DOTA dataset for oriented object detection. The code is available at https://github.com/wenwenyu/TCM.

READ FULL TEXT

page 7

page 11

page 12

page 15

research
02/28/2023

Turning a CLIP Model into a Scene Text Detector

The recent large-scale Contrastive Language-Image Pretraining (CLIP) mod...
research
03/28/2023

HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models

Human-Object Interaction (HOI) detection aims to localize human-object p...
research
12/14/2021

CORE-Text: Improving Scene Text Detection with Contrastive Relational Reasoning

Localizing text instances in natural scenes is regarded as a fundamental...
research
10/12/2021

On Exploring and Improving Robustness of Scene Text Detection Models

It is crucial to understand the robustness of text detection models with...
research
11/20/2019

Real-time Scene Text Detection with Differentiable Binarization

Recently, segmentation-based methods are quite popular in scene text det...
research
03/30/2023

PMatch: Paired Masked Image Modeling for Dense Geometric Matching

Dense geometric matching determines the dense pixel-wise correspondence ...
research
03/02/2023

LANDMARK: Language-guided Representation Enhancement Framework for Scene Graph Generation

Scene graph generation (SGG) is a sophisticated task that suffers from b...

Please sign up or login with your details

Forgot password? Click here to reset