Vision Transformer for Fast and Efficient Scene Text Recognition

05/18/2021
by   Rowel Atienza, et al.
10

Scene text recognition (STR) enables computers to read text in natural scenes such as object labels, road signs and instructions. STR helps machines perform informed decisions such as what object to pick, which direction to go, and what is the next step of action. In the body of work on STR, the focus has always been on recognition accuracy. There is little emphasis placed on speed and computational efficiency which are equally important especially for energy-constrained mobile machines. In this paper we propose ViTSTR, an STR with a simple single stage model architecture built on a compute and parameter efficient vision transformer (ViT). On a comparable strong baseline method such as TRBA with accuracy of 84.3 accuracy of 82.6 43.4 achieves 80.3 requiring only 10.9 augmentation, our base ViTSTR outperforms TRBA at 85.2 augmentation) at 2.3x the speed but requires 73.2 more FLOPS. In terms of trade-offs, nearly all ViTSTR configurations are at or near the frontiers to maximize accuracy, speed and computational efficiency all at the same time.

READ FULL TEXT

page 9

page 12

research
11/09/2022

Masked Vision-Language Transformers for Scene Text Recognition

Scene text recognition (STR) enables computers to recognize and read the...
research
04/12/2023

DynamicDet: A Unified Dynamic Architecture for Object Detection

Dynamic neural network is an emerging research topic in deep learning. W...
research
03/14/2023

I3D: Transformer architectures with input-dependent dynamic depth for speech recognition

Transformer-based end-to-end speech recognition has achieved great succe...
research
11/15/2022

YORO – Lightweight End to End Visual Grounding

We present YORO - a multi-modal transformer encoder-only architecture fo...
research
09/08/2022

Multi-Granularity Prediction for Scene Text Recognition

Scene text recognition (STR) has been an active research topic in comput...
research
02/01/2021

Video Transformer Network

This paper presents VTN, a transformer-based framework for video recogni...
research
02/05/2022

TorchMD-NET: Equivariant Transformers for Neural Network based Molecular Potentials

The prediction of quantum mechanical properties is historically plagued ...

Please sign up or login with your details

Forgot password? Click here to reset