Augmented Transformers with Adaptive n-grams Embedding for Multilingual Scene Text Recognition

02/28/2023
by   Xueming Yan, et al.
0

While vision transformers have been highly successful in improving the performance in image-based tasks, not much work has been reported on applying transformers to multilingual scene text recognition due to the complexities in the visual appearance of multilingual texts. To fill the gap, this paper proposes an augmented transformer architecture with n-grams embedding and cross-language rectification (TANGER). TANGER consists of a primary transformer with single patch embeddings of visual images, and a supplementary transformer with adaptive n-grams embeddings that aims to flexibly explore the potential correlations between neighbouring visual patches, which is essential for feature extraction from multilingual scene texts. Cross-language rectification is achieved with a loss function that takes into account both language identification and contextual coherence scoring. Extensive comparative studies are conducted on four widely used benchmark datasets as well as a new multilingual scene text dataset containing Indonesian, English, and Chinese collected from tourism scenes in Indonesia. Our experimental results demonstrate that TANGER is considerably better compared to the state-of-the-art, especially in handling complex multilingual scene texts.

READ FULL TEXT

page 1

page 3

page 6

page 7

research
05/18/2023

mLongT5: A Multilingual and Efficient Text-To-Text Transformer for Longer Sequences

We present our work on developing a multilingual, efficient text-to-text...
research
09/14/2022

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Effective scaling and a flexible task interface enable large language mo...
research
08/24/2021

Are the Multilingual Models Better? Improving Czech Sentiment with Transformers

In this paper, we aim at improving Czech sentiment with transformer-base...
research
11/09/2022

Masked Vision-Language Transformers for Scene Text Recognition

Scene text recognition (STR) enables computers to recognize and read the...
research
08/08/2023

Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval

Most existing cross-modal retrieval methods employ two-stream encoders w...
research
05/31/2023

FEED PETs: Further Experimentation and Expansion on the Disambiguation of Potentially Euphemistic Terms

Transformers have been shown to work well for the task of English euphem...
research
07/23/2014

Joint Energy-based Detection and Classificationon of Multilingual Text Lines

This paper proposes a new hierarchical MDL-based model for a joint detec...

Please sign up or login with your details

Forgot password? Click here to reset