DeepAI AI Chat
Log In Sign Up

On Vocabulary Reliance in Scene Text Recognition

by   Zhaoyi Wan, et al.
University of Rochester

The pursuit of high performance on public benchmarks has been the driving force for research in scene text recognition, and notable progress has been achieved. However, a close investigation reveals a startling fact that the state-of-the-art methods perform well on images with words within vocabulary but generalize poorly to images with words outside vocabulary. We call this phenomenon "vocabulary reliance". In this paper, we establish an analytical framework to conduct an in-depth study on the problem of vocabulary reliance in scene text recognition. Key findings include: (1) Vocabulary reliance is ubiquitous, i.e., all existing algorithms more or less exhibit such characteristic; (2) Attention-based decoders prove weak in generalizing to words outside vocabulary and segmentation-based decoders perform well in utilizing visual features; (3) Context modeling is highly coupled with the prediction layers. These findings provide new insights and can benefit future research in scene text recognition. Furthermore, we propose a simple yet effective mutual learning strategy to allow models of two families (attention-based and segmentation-based) to learn collaboratively. This remedy alleviates the problem of vocabulary reliance and improves the overall scene text recognition performance.


1st Place Solution to ECCV 2022 Challenge on Out of Vocabulary Scene Text Understanding: Cropped Word Recognition

This report presents our winner solution to ECCV 2022 challenge on Out-o...

Fine-Grained Address Segmentation for Attention-Based Variable-Degree Prefetching

Machine learning algorithms have shown potential to improve prefetching ...

Decoupling Visual-Semantic Feature Learning for Robust Scene Text Recognition

Semantic information has been proved effective in scene text recognition...

Vision-Language Adaptive Mutual Decoder for OOV-STR

Recent works have shown huge success of deep learning models for common ...

Improving Structured Text Recognition with Regular Expression Biasing

We study the problem of recognizing structured text, i.e. text that foll...

Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features

Detecting actions in untrimmed videos should not be limited to a small, ...