Linguistic More: Taking a Further Step toward Efficient and Accurate Scene Text Recognition

05/09/2023
by   Boqiang Zhang, et al.
0

Vision model have gained increasing attention due to their simplicity and efficiency in Scene Text Recognition (STR) task. However, due to lacking the perception of linguistic knowledge and information, recent vision models suffer from two problems: (1) the pure vision-based query results in attention drift, which usually causes poor recognition and is summarized as linguistic insensitive drift (LID) problem in this paper. (2) the visual feature is suboptimal for the recognition in some vision-missing cases (e.g. occlusion, etc.). To address these issues, we propose a Linguistic Perception Vision model (LPV), which explores the linguistic capability of vision model for accurate text recognition. To alleviate the LID problem, we introduce a Cascade Position Attention (CPA) mechanism that obtains high-quality and accurate attention maps through step-wise optimization and linguistic information mining. Furthermore, a Global Linguistic Reconstruction Module (GLRM) is proposed to improve the representation of visual features by perceiving the linguistic information in the visual space, which gradually converts visual features into semantically rich ones during the cascade process. Different from previous methods, our method obtains SOTA results while keeping low complexity (92.4 only 8.11M parameters). Code is available at https://github.com/CyrilSterling/LPV.

READ FULL TEXT

page 1

page 7

research
08/22/2021

From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network

In this paper, we abandon the dominant complex language model and rethin...
research
11/09/2022

Masked Vision-Language Transformers for Scene Text Recognition

Scene text recognition (STR) enables computers to recognize and read the...
research
03/17/2022

Biasing Like Human: A Cognitive Bias Framework for Scene Graph Generation

Scene graph generation is a sophisticated task because there is no speci...
research
07/25/2023

Multi-Granularity Prediction with Learnable Fusion for Scene Text Recognition

Due to the enormous technical challenges and wide range of applications,...
research
03/24/2022

Continuous Emotion Recognition using Visual-audio-linguistic information: A Technical Report for ABAW3

We propose a cross-modal co-attention model for continuous emotion recog...
research
09/08/2022

Multi-Granularity Prediction for Scene Text Recognition

Scene text recognition (STR) has been an active research topic in comput...

Please sign up or login with your details

Forgot password? Click here to reset