A Feasible Framework for Arbitrary-Shaped Scene Text Recognition

by   Jinjin Zhang, et al.

Deep learning based methods have achieved surprising progress in Scene Text Recognition (STR), one of classic problems in computer vision. In this paper, we propose a feasible framework for multi-lingual arbitrary-shaped STR, including instance segmentation based text detection and language model based attention mechanism for text recognition. Our STR algorithm not only recognizes Latin and Non-Latin characters, but also supports arbitrary-shaped text recognition. Our method wins the championship on Scene Text Spotting Task (Latin Only, Latin and Chinese) of ICDAR2019 Robust Reading Challenge on ArbitraryShaped Text Competition. Code is available at https://github.com/zhang0jhon/AttentionOCR.


page 1

page 2

page 4

page 5

page 6


ICDAR2019 Robust Reading Challenge on Arbitrary-Shaped Text (RRC-ArT)

This paper reports the ICDAR2019 Robust Reading Challenge on Arbitrary-S...

SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition

End-to-end scene text spotting has attracted great attention in recent y...

Ultra Light OCR Competition Technical Report

Ultra Light OCR Competition is a Chinese scene text recognition competit...

Alchemy: Techniques for Rectification Based Irregular Scene Text Recognition

Reading text from natural images is challenging due to the great variety...

Focus-Enhanced Scene Text Recognition with Deformable Convolutions

Recently, scene text recognition methods based on deep learning have spr...

TextRay: Contour-based Geometric Modeling for Arbitrary-shaped Scene Text Detection

Arbitrary-shaped text detection is a challenging task due to the complex...

Decoupling Recognition from Detection: Single Shot Self-Reliant Scene Text Spotter

Typical text spotters follow the two-stage spotting strategy: detect the...

Code Repositories


Scene text recognition

view repo


Computer Vision related papers and code

view repo

1 Introduction

Deep learning has brought significant revolution in computer vision and machine learning in recent years. Since AlexNet 


won the championship on ImageNet 


image classification in 2012, an increasing number of researchers have paid attention to deep learning and its wide application, such as computer vision, natural language process and speech recognition. With the rapid development of deep learning, tremendous improvement has been achieved in various research areas, especially in machine learning and artificial intelligence. For instance, Batch Normalization 

[19] and ResNet [16]

make it possible to train deeper neural networks steadily as well as alleviate gradient vanishing and exploding problems. R-CNN based object detection series 

[13, 12, 40] significantly improve mean Average Precision (mAP) and Mask R-CNN [15] extend R-CNN based detection methods to instance segmentation. Semi-supervised attention mechanism in deep learning demonstrates its effectiveness in both computer vision [35, 55, 18] and natural language process [2, 31, 52]. Moreover, Neural Architecture Search (NAS)  [58, 27] can automatically find the optimal network in various areas, including EfficientNet [49], EfficientDet [50], and so on.

Figure 1: Challenges in STR. Images from ICDAR2019-ArT [5] show the STR challenges in real-world scenes including various languages and shapes.

STR, one of the extensively used techniques, benefits a lot from the boom of deep learning. Nevertheless, there are still several problems in realistic scenario as show in Fig. 1. Firstly, various text regions including horizontal, vertical and curved text require a high-quality detector to handle with arbitrary-shaped text detection. Secondly, it is difficult for a general model to recognize text region with a variety of languages and shapes.

Figure 2: STR architecture. Our algorithm first conduct text segmentation with Cascade Mask R-CNN and then recognize text region with attention based sequence transduction method.

In this paper, we deem STR as a cross domain problem including both object detection and language model based sequence transduction instead of simple text detection and classification, which benefits from the recent achievements in computer vision and natural language process. Instead of individually concentrating on scene text detection or text recognition, we propose a feasible framework for multi-lingual arbitrary-shaped scene text spotting [5]. Our algorithm adopts general instance segmentation method for robust arbitrary-shaped text detection, and simultaneously takes context information into consideration for text recognition by attention based Word2vec [33] method. Furthermore, word embedding based arbitrary-shaped text recognition guarantees the convergence of end-to-end soft attention mechanism via a weakly supervised method. In brief, we present an universal and robust proposal for real-world STR by the combination of instance segmentation and attention based sequence transduction methods.

The main contributions in this paper consist of three aspects as following:

1. We propose a feasible framework for STR that is capable of multi-lingual arbitrary-shaped text detection and recognition, and the results in several STR datasets demonstrate the effectiveness and robustness of our algorithm.

2. We propose an attention based text recognition method which is able to recognize irregular text in Latin or Non-Latin characters in the unique model and combine with the language model based method.

3. Our algorithm is easy to extend with state-of-the-art algorithm in attention mechanism for high level applications such as text based visual question answering, semantic parsing, et al.

2 Related Work

In this section, we will present the recent work relevant to STR, including object detection, attention mechanism for alignment in sequence transduction. Our STR framework is designed referring to previous work introduced below, aiming to handle STR in a general way which combines object detection and attention mechanism techniques.

2.1 Text Detection

Object detection is one of the fundamental research areas in computer vision which has attracted extensive attention with the remarkable progress in the last several years. There have been many efficient object detector architectures, such as one-stage detector like YOLO [37, 38, 39], SSD [28], RetinaNet [26], as well as anchor-free detectors [23, 9, 51]. Nevertheless, the majority of one-stage detectors suffer from worser behavior on small objects and lower precision compared with two-stage object detectors. Two-stage detectors have better performance on realistic scenes in spite of higher computational requirement. Further improved architectures have been proposed to achieve robust and accurate results, like FPN [25, 11], DCN [6, 57], SNIP [43, 44], Cascade R-CNN [3], etc.

Regarding text detection in STR, multi-scale and irregular text regions are the principal problems appearing in real-world scenes. EAST [56] and FOTS [29] proposed for text detection still have limitation in arbitrary-shaped text. Mask TextSpotter [32] adopts Mask R-CNN based instance segmentation method for end-to-end trainable scene text spotting but need character level segmentation. In this paper, we utilize Cascade R-CNN based irregular text region segmentation for accurate text localization which assists the following text recognition section.

2.2 Text Recognition

Connectionist Temporal Classification (CTC) [14]

is a widely used method in text recognition without knowing the alignment between the input and the output. Covolutional Recurrent Neural Network (CRNN) 


, one of the most well-known methods in text recognition, is a combination of Convolution Neural Network (CNN), Recurrent Neural Network (RNN) and CTC loss for image-based sequence recognition tasks. However, CTC based method is incapable of handling multi-oriented text because of single-oriented feature slice. With the advent of attention mechanism in deep learning,  

[55] demonstrates the effectiveness of the attention based method for alignment between visual feature and word embedding in image caption.

Attention mechanism has been brought in deep learning since 2014, including reinforcement learning based Recurrent Attention Model (RAM) 

[35] in computer vision and end-to-end trainable Bahdanau Attention (Additive Attention) [2] which solves the context alignment problem existing in Seq2Seq [46] respectively. Luong Attention (Multiplicative Attention) [31] concludes and extends Bahdanau Attention to general attention mechanism formulated by 3 steps: score function for similarity metric, alignment function for attention weight, and context function for aligned context feature.  [52] proposes the Transformer architecture based solely on multi-head attention mechanism motivated by self-attention representation learning [4] as well as end-to-end memory network [45], and simultaneously implements the parallelization in Transformer as CNN based methods like ByteNet [20], ConvS2S [10].

Inspired by the practicable end-to-end soft attention mechanism, we propose an attention based text recognition method for arbitrary-shaped text region segmented by Cascade Mask R-CNN, which aligns visual context with corresponding character embedding in an semi-supervised optimization method.

3 Architecture

As illustrated in Fig. 2, with Cascade Mask R-CNN [3, 15] based instance segmentation method, we extract the text region from image at first and then feed the masked image to attention based text recognition module for sequence classification.

3.1 Cascade Mask R-CNN with Text-Aware RPN

We adopt Cascade R-CNN [3] based instance segmentation method for text detection, but there is still slightly different from that in object detection. Because the majority of text regions are long narrow rectangles, it is inappropriate that applies conventional anchor-based two-stage object detector to text detection straight forward.

With Faster R-CNN [40] default positive-anchor parameter with Intersection over Union (IOU) threshold higher than 0.7, there are few positive anchors for text due to unmatched shapes and may induce difficult convergence in Region Proposal Network (RPN) during training. It is essential to redesign prior anchors sample strategy with proper anchor ratios and scales for more matched ground-truth positive anchors, namely True Positives (TP). Furthermore, inspired by recent architectures which demonstrate the effectiveness of and kernels including InceptionV3 [48], TextBoxes [24, 34], an inception text-aware RPN architecture as shown in Fig. 3

is proposed for rich and robust text feature extraction. Cascade R-CNN, a multi-stage extension of two-stage R-CNN object detection framework, is exploited to obtain high quality object detection, which can effectively reject close False Positives (FP) and improve the detection performance. Finally, The mask branch as in Mask R-CNN 

[15] in the text detection module predicts the final masked arbitrary-shaped text regions for latter scene text recognition.

Figure 3: Inception text-aware RPN architecture. The architecture is adopted for text detection inspired from inception structure.

Regarding Cascade Mask R-CNN for text segmentation, we optimize the model by minimizing the multi-task loss function as below:


where N is the number of multiple cascade stages.

is label logits and

is ground-truth one-hot labels.

is estimated bounding box transformation parameters and

is the ground-truth. and are the same as Mask R-CNN [15]. is the summation of multi-stage and with increasing IOU thresholds which represent the cross-entropy loss and smoothed loss in Fast R-CNN [12] respectively. is the weight decay factor and represents the regularization loss.

3.2 Attention based Cross-Modal Alignment and Sequence Classification

Regarding text recognition, it should be a cross-domain representation learning and alignment problem including computer vision and sequence transduction instead of simple image classification as far as we are concerned. The reason is that text recognition in realistic scenes always represents the linguistic significance. Namely, scene text recognition should benefit from language model based natural language process methods.

We aim to propose a universal sequence classification framework that is capable of integrating multi-lingual arbitrary-shaped text recognition in the unique recognition model. Long Short-term Memory (LSTM) 

[17] born for sequence alignment is adopted for text sequence recognition in our framework. Furthermore, for purpose of irregular text recognition, we propose an visual attention based method referring to Bahdanau Attention Mechanism [2] which learns cross-modal alignment between visual feature and LSTM hidden state for classification in a semi-supervised way.

With the global word embedding matrix and CNN feature map , we can model the sequence classification as soft end-to-end attention mechanism based alignment problems between visual attention based context feature from by attention weight and the corresponding embedding in at each time step . Let be the context features and let be the embeddings of

time steps in the target string. Using the chain rule the conditional probability of the sequence

can be decomposed as as following:


where represents the maximum sequence length with the End of Sequence (EOS) character proposed in machine translation and is the expectation of previous embedding. Simultaneously is the previous LSTM hidden state and

is the context vector calculated according to the following attention mechanism formulas:


where the initial LSTM hidden state in computed from representing global feature and the LSTM architecture is the same as that in image caption [55]. In addition, we choose Bahdanau Attention Mechanism [2] as our default formulated as following:


where and are the CNN feature map height and width respectively.

The final optimization function, namely Masked Cross Entropy (MCE) loss for dynamic sequence length, is formulated as below:


where is logits calculated from previous embedding , current context feature and LSTM hidden state as below:


is the one-hot ground-truth sequence labels, is the corresponding sequence mask for dynamic sequence labels as following:


4 Experiment

Figure 4: Attention mechanism visualization. The results show the correct attention location in sequence and demonstrate the effectiveness of attention mechanism based arbitrary-shaped text recognition.

In this section, we will present our instance segmentation based text detection and attention based text recognition and how to train the models in detail. We are not able to build an end-to-end STR framework due to the limited 1080Ti GPU memory. Our STR algorithm is implemented based on TensorFlow 

[1] and its high-level API Tensorpack [53], with speed and flexibility built together.

4.1 Text Segmentation

We follow the default protocol in Mask R-CNN and Cascade R-CNN with ResNet101-FPN backbone besides some modified parameters for text detection. We use RPN batch size 128 and batch size 256 for ROIAlign head, half as the default Mask R-CNN parameters due to the less TP anchors in text detection. Note that for the purpose of data augmentation, we conduct random affine transformation to generate more text images including rotate, translate, shear, resize and crop transform.

Stochastic Gradient Descent (SGD) with momentum is adopted as default text segmentation optimizer, with learning rate of 0.01 and momentum of 0.9. We use a weight decay of 0.0001 and train on 8 GPUs with 1 image per GPU for 360k iterations, with the learning rate decreased by 10 at the 240k iteration and 320k iteration. We set the maximum image size to 1600 for accurate small text region. Training images are resized for multi-scale training with their scale (shorter edge) in the range of 640 to 1280 pixels. Pre-train detection model from ICDAR2017-MLT [36], ICDAR2017RCTW [42] is used for weight initialization for text detection. Furthermore, the RPN anchor ratios of [1/9, 1/4, 1/2, 1, 2, 4, 9] is adopt for extremely irregular text detection. Then the mask branch predicts the final segmented text region for latter text recognition.

4.2 Sequence Classification

Figure 5: Arbitrary-oriented text detection and recognition.
Figure 6: STR visualization. The results prove that our framework is instrumental for multi-lingual arbitrary-shaped STR and simultaneously robust for realistic scenes.

In this part, we exploit InceptionV4 [47]

as our text recognition backbone, which is appropriate for fine-grained distinguishable text feature extraction. Furthermore, referring to Google’s Neural Machine Translation (GNMT) 

[54, 30] and image caption methods[55]

, we model the text recognition as sequence transduction between visual feature and sequence embedding by Bahdanau Attention based LSTM. In addition, EOS and sequence mask are adopted for sequence padding to transform various length sequences to fixed length.

For the purpose of universal multi-lingual arbitrary-shaped text recognition, we conduct random affine transformation as well as random color jittering with text image and corresponding polygons and then crop and resize image with longer size of 256 pixels. Finally, we random translate and flip augmented image and pad the irregular text region to size of with zero. The fixed maximum sequence length, namely LSTM sequence length, is set as 26 ( of 27 indeed with EOS). Compared with conventional LSTM, our LSTM takes not only previous hidden state and cell state, but alse the current attention feature and previous embedding as input. The LSTM size is 512 and embedding matrix shape is representing embedding size of 256 and 5435 characters including Chinese, English letters, digits and other symbols.

We train our recognition model via Adam [21], an adaptive learning rate optimizer, with learning rate of 0.0001. We use a weight decay of 0.00001 and train with 24 image batch size on 1 GPU for 460k iterations, with the learning rate decreased by 10 at the 320k iteration and 400k iteration. Pre-train InceptionV4 model from ImageNet is loaded for initialization and multiple datasets including LSVT, ArT, ReCTS, COCO-Text, ICDAR2017 are used to train the multi-lingual irregular text recognition model. Meanwhile, the padded fixed length sequences with corresponding sequence masks in GNMT make it possible to compute masked cross entropy loss for dynamic sequence length.

As shown in Fig. 4

, our algorithm is capable of learning the alignment and classifying sequence in a simi-supervised method. Meanwhile, Fig.

5 demonstrate that our model is able to handle arbitrary-oriented text recognition.

4.3 Results

Recall 49.29%
Precision 56.03%
Hmean 52.45%
1-N.E.D 53.86%
Table 1: STR Results on ICDAR2019-ArT (Latin Only).
Recall 47.98%
Precision 52.56%
Hmean 50.17%
1-N.E.D 54.91%
Table 2: STR Results on ICDAR2019-ArT (Latin and Chinese).
Recall 53.23%
Precision 59.37%
Hmean 56.13%
1-N.E.D 63.16%
Table 3: STR Results on ICDAR2019-LSVT.
Recall 93.62%
Precision 87.22%
Hmean 90.30%
1-N.E.D 76.60%
Table 4: STR Results on ICDAR2019-ReCTS.

Fig. 6 shows the text detection and recognition results. As reported in  [5], our algorithm achieves the championship in the task of scene text spotting with the accuracy H-mean score of 52.45% in Latin and 50.17% in Latin and Chinese. Normalized Edit Distance metric (1-N.E.D specifically) is treated as the official ranking metric is formulated as below:


where stands for the Levenshtein Distance, and and denote the predicted text line in string and the corresponding ground truths in the regions. Note that the results are tested in an unique model. Numeric results on Latin Only, Latin and Chinese are listed in Table 1 and Table 2, respectively. Moreover, we also achieve competitive results on ICDAR2019-LSVT in Table 3 and ICDAR2019-ReCTS in Table 4, which demonstrate the effectiveness and robustness of our STR algorithm. Note that the detection model is trained on different datasets seperately with pretrained detection model from ICDAR2017-MLT and ICDAR2017RCTW, while the recognition model is the same one as mentioned in Section 4.2.

5 Conclusion and Future Work

Figure 7: Failure case. The failed case indicates the necessity of end-to-end STR that recognition might contribute to detection.

This paper aims to propose a universal framework for STR with the combination of object detection and attention based language model. The competition results in ICDAR2019 including LSVT, ArT, ReCTS demonstrate the effectiveness and robustness of our algorithm. In addition, it is convenient to extend the current algorithm to state-of-the-art method such as replacing detection or attention mechanism with better architectures, such as fast and accurate detector EfficientDet [50] or Transformer [52] based self-attention mechanism. Moreover, it is feasible to synthesize text images from natural language process corpus for data augmentation and helpful for attention-based language model.

In future, it is imperative to build an end-to-end differentiable STR algorithm with both detection and recognition which requires GPUs with large memory like V100. It is essential to eliminate the detection failure case as illustrated in Fig. 7 with semantic information. Language model based methods like BERT [8] should be beneficial in our framework which takes the context of whole sentences into consideration instead of the previous word merely. Furthermore, visual based question answering or semantic analysis modules can be integrated with the framework for text based high-level semantic applications.