TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance

11/16/2021
by   Yue Tao, et al.
0

Scene text recognition (STR) is an important bridge between images and text, attracting abundant research attention. While convolutional neural networks (CNNS) have achieved remarkable progress in this task, most of the existing works need an extra module (context modeling module) to help CNN to capture global dependencies to solve the inductive bias and strengthen the relationship between text features. Recently, the transformer has been proposed as a promising network for global context modeling by self-attention mechanism, but one of the main shortcomings, when applied to recognition, is the efficiency. We propose a 1-D split to address the challenges of complexity and replace the CNN with the transformer encoder to reduce the need for a context modeling module. Furthermore, recent methods use a frozen initial embedding to guide the decoder to decode the features to text, leading to a loss of accuracy. We propose to use a learnable initial embedding learned from the transformer encoder to make it adaptive to different input images. Above all, we introduce a novel architecture for text recognition, named TRansformer-based text recognizer with Initial embedding Guidance (TRIG), composed of three stages (transformation, feature extraction, and prediction). Extensive experiments show that our approach can achieve state-of-the-art on text recognition benchmarks.

READ FULL TEXT

page 7

page 13

page 14

research
09/13/2022

DMTNet: Dynamic Multi-scale Network for Dual-pixel Images Defocus Deblurring with Transformer

Recent works achieve excellent results in defocus deblurring task based ...
research
03/01/2023

The style transformer with common knowledge optimization for image-text retrieval

Image-text retrieval which associates different modalities has drawn bro...
research
03/28/2023

X-Mesh: Towards Fast and Accurate Text-driven 3D Stylization via Dynamic Textual Guidance

Text-driven 3D stylization is a complex and crucial task in the fields o...
research
05/27/2020

SPIN: Structure-Preserving Inner Offset Network for Scene Text Recognition

Arbitrary text appearance poses a great challenge in scene text recognit...
research
09/28/2022

Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding

Multimodal transformer exhibits high capacity and flexibility to align i...
research
11/09/2022

Pure Transformer with Integrated Experts for Scene Text Recognition

Scene text recognition (STR) involves the task of reading text in croppe...
research
09/23/2020

Hamming OCR: A Locality Sensitive Hashing Neural Network for Scene Text Recognition

Recently, inspired by Transformer, self-attention-based scene text recog...

Please sign up or login with your details

Forgot password? Click here to reset