CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model

05/23/2023
by   Shuai Zhao, et al.
0

Pre-trained vision-language models are the de-facto foundation models for various downstream tasks. However, this trend has not extended to the field of scene text recognition (STR), despite the potential of CLIP to serve as a powerful scene text reader. CLIP can robustly identify regular (horizontal) and irregular (rotated, curved, blurred, or occluded) text in natural images. With such merits, we introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP. It has two encoder-decoder branches: a visual branch and a cross-modal branch. The visual branch provides an initial prediction based on the visual feature, and the cross-modal branch refines this prediction by addressing the discrepancy between the visual feature and text semantics. To fully leverage the capabilities of both branches, we design a dual predict-and-refine decoding scheme for inference. CLIP4STR achieves new state-of-the-art performance on 11 STR benchmarks. Additionally, a comprehensive empirical study is provided to enhance the understanding of the adaptation of CLIP to STR. We believe our method establishes a simple but strong baseline for future STR research with VL models.

READ FULL TEXT
research
03/14/2022

Leveraging Visual Knowledge in Language Tasks: An Empirical Study on Intermediate Pre-training for Cross-modal Knowledge Transfer

Pre-trained language models are still far from human performance in task...
research
05/12/2023

Towards Versatile and Efficient Visual Knowledge Injection into Pre-trained Language Models with Cross-Modal Adapters

Humans learn language via multi-modal knowledge. However, due to the tex...
research
08/19/2023

An Empirical Study of CLIP for Text-based Person Search

Text-based Person Search (TBPS) aims to retrieve the person images using...
research
06/17/2022

Bridge-Tower: Building Bridges Between Encoders in Vision-Language Representation Learning

Vision-Language (VL) models with the Two-Tower architecture have dominat...
research
09/08/2022

Levenshtein OCR

A novel scene text recognizer based on Vision-Language Transformer (VLT)...
research
04/05/2023

Calibrating Cross-modal Feature for Text-Based Person Searching

We present a novel and effective method calibrating cross-modal features...
research
06/30/2020

ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph

We propose a knowledge-enhanced approach, ERNIE-ViL, to learn joint repr...

Please sign up or login with your details

Forgot password? Click here to reset