DeepAI AI Chat
Log In Sign Up

Joint Visual Semantic Reasoning: Multi-Stage Decoder for Text Recognition

by   Ayan Kumar Bhunia, et al.

Although text recognition has significantly evolved over the years, state-of-the-art (SOTA) models still struggle in the wild scenarios due to complex backgrounds, varying fonts, uncontrolled illuminations, distortions and other artefacts. This is because such models solely depend on visual information for text recognition, thus lacking semantic reasoning capabilities. In this paper, we argue that semantic information offers a complementary role in addition to visual only. More specifically, we additionally utilize semantic information by proposing a multi-stage multi-scale attentional decoder that performs joint visual-semantic reasoning. Our novelty lies in the intuition that for text recognition, the prediction should be refined in a stage-wise manner. Therefore our key contribution is in designing a stage-wise unrolling attentional decoder where non-differentiability, invoked by discretely predicted character labels, needs to be bypassed for end-to-end training. While the first stage predicts using visual features, subsequent stages refine on top of it using joint visual-semantic information. Additionally, we introduce multi-scale 2D attention along with dense and residual connections between different stages to deal with varying scales of character sizes, for better performance and faster convergence during training. Experimental results show our approach to outperform existing SOTA methods by a considerable margin.


page 4

page 6


Towards Accurate Scene Text Recognition with Semantic Reasoning Networks

Scene text image contains two levels of contents: visual texture and sem...

Decoupling Visual-Semantic Feature Learning for Robust Scene Text Recognition

Semantic information has been proved effective in scene text recognition...

SPGNet: Semantic Prediction Guidance for Scene Parsing

Multi-scale context module and single-stage encoder-decoder structure ar...

Enhancing Indic Handwritten Text Recognition Using Global Semantic Information

Handwritten Text Recognition (HTR) is more interesting and challenging t...

Visual Semantic Information Pursuit: A Survey

Visual semantic information comprises two important parts: the meaning o...

Multi-scale Self-calibrated Network for Image Light Source Transfer

Image light source transfer (LLST), as the most challenging task in the ...

Accessibility and Trajectory-Based Text Characterization

Several complex systems are characterized by presenting intricate charac...