Scene text recognition
Deep learning based methods have achieved surprising progress in Scene Text Recognition (STR), one of classic problems in computer vision. In this paper, we propose a feasible framework for multi-lingual arbitrary-shaped STR, including instance segmentation based text detection and language model based attention mechanism for text recognition. Our STR algorithm not only recognizes Latin and Non-Latin characters, but also supports arbitrary-shaped text recognition. Our method wins the championship on Scene Text Spotting Task (Latin Only, Latin and Chinese) of ICDAR2019 Robust Reading Challenge on ArbitraryShaped Text Competition. Code is available at https://github.com/zhang0jhon/AttentionOCR.READ FULL TEXT VIEW PDF
This paper reports the ICDAR2019 Robust Reading Challenge on Arbitrary-S...
Reading text from natural images is challenging due to the great variety...
Recently, scene text recognition methods based on deep learning have spr...
Arbitrary-shaped text detection is a challenging task due to the complex...
We propose an end-to-end trainable network that can simultaneously detec...
Scene text recognition has been a hot topic in computer vision. Recent
Recently, end-to-end text spotting that aims to detect and recognize tex...
Scene text recognition
Computer Vision related papers and code
Deep learning has brought significant revolution in computer vision and machine learning in recent years. Since AlexNet
won the championship on ImageNet
image classification in 2012, an increasing number of researchers have paid attention to deep learning and its wide application, such as computer vision, natural language process and speech recognition. With the rapid development of deep learning, tremendous improvement has been achieved in various research areas, especially in machine learning and artificial intelligence. For instance, Batch Normalization and ResNet 
make it possible to train deeper neural networks steadily as well as alleviate gradient vanishing and exploding problems. R-CNN based object detection series[13, 12, 40] significantly improve mean Average Precision (mAP) and Mask R-CNN  extend R-CNN based detection methods to instance segmentation. Semi-supervised attention mechanism in deep learning demonstrates its effectiveness in both computer vision [35, 55, 18] and natural language process [2, 31, 52]. Moreover, Neural Architecture Search (NAS) [58, 27] can automatically find the optimal network in various areas, including EfficientNet , EfficientDet , and so on.
STR, one of the extensively used techniques, benefits a lot from the boom of deep learning. Nevertheless, there are still several problems in realistic scenario as show in Fig. 1. Firstly, various text regions including horizontal, vertical and curved text require a high-quality detector to handle with arbitrary-shaped text detection. Secondly, it is difficult for a general model to recognize text region with a variety of languages and shapes.
In this paper, we deem STR as a cross domain problem including both object detection and language model based sequence transduction instead of simple text detection and classification, which benefits from the recent achievements in computer vision and natural language process. Instead of individually concentrating on scene text detection or text recognition, we propose a feasible framework for multi-lingual arbitrary-shaped scene text spotting . Our algorithm adopts general instance segmentation method for robust arbitrary-shaped text detection, and simultaneously takes context information into consideration for text recognition by attention based Word2vec  method. Furthermore, word embedding based arbitrary-shaped text recognition guarantees the convergence of end-to-end soft attention mechanism via a weakly supervised method. In brief, we present an universal and robust proposal for real-world STR by the combination of instance segmentation and attention based sequence transduction methods.
The main contributions in this paper consist of three aspects as following:
1. We propose a feasible framework for STR that is capable of multi-lingual arbitrary-shaped text detection and recognition, and the results in several STR datasets demonstrate the effectiveness and robustness of our algorithm.
2. We propose an attention based text recognition method which is able to recognize irregular text in Latin or Non-Latin characters in the unique model and combine with the language model based method.
3. Our algorithm is easy to extend with state-of-the-art algorithm in attention mechanism for high level applications such as text based visual question answering, semantic parsing, et al.
In this section, we will present the recent work relevant to STR, including object detection, attention mechanism for alignment in sequence transduction. Our STR framework is designed referring to previous work introduced below, aiming to handle STR in a general way which combines object detection and attention mechanism techniques.
Object detection is one of the fundamental research areas in computer vision which has attracted extensive attention with the remarkable progress in the last several years. There have been many efficient object detector architectures, such as one-stage detector like YOLO [37, 38, 39], SSD , RetinaNet , as well as anchor-free detectors [23, 9, 51]. Nevertheless, the majority of one-stage detectors suffer from worser behavior on small objects and lower precision compared with two-stage object detectors. Two-stage detectors have better performance on realistic scenes in spite of higher computational requirement. Further improved architectures have been proposed to achieve robust and accurate results, like FPN [25, 11], DCN [6, 57], SNIP [43, 44], Cascade R-CNN , etc.
Regarding text detection in STR, multi-scale and irregular text regions are the principal problems appearing in real-world scenes. EAST  and FOTS  proposed for text detection still have limitation in arbitrary-shaped text. Mask TextSpotter  adopts Mask R-CNN based instance segmentation method for end-to-end trainable scene text spotting but need character level segmentation. In this paper, we utilize Cascade R-CNN based irregular text region segmentation for accurate text localization which assists the following text recognition section.
Connectionist Temporal Classification (CTC) 
is a widely used method in text recognition without knowing the alignment between the input and the output. Covolutional Recurrent Neural Network (CRNN)
, one of the most well-known methods in text recognition, is a combination of Convolution Neural Network (CNN), Recurrent Neural Network (RNN) and CTC loss for image-based sequence recognition tasks. However, CTC based method is incapable of handling multi-oriented text because of single-oriented feature slice. With the advent of attention mechanism in deep learning, demonstrates the effectiveness of the attention based method for alignment between visual feature and word embedding in image caption.
Inspired by the practicable end-to-end soft attention mechanism, we propose an attention based text recognition method for arbitrary-shaped text region segmented by Cascade Mask R-CNN, which aligns visual context with corresponding character embedding in an semi-supervised optimization method.
As illustrated in Fig. 2, with Cascade Mask R-CNN [3, 15] based instance segmentation method, we extract the text region from image at first and then feed the masked image to attention based text recognition module for sequence classification.
We adopt Cascade R-CNN  based instance segmentation method for text detection, but there is still slightly different from that in object detection. Because the majority of text regions are long narrow rectangles, it is inappropriate that applies conventional anchor-based two-stage object detector to text detection straight forward.
With Faster R-CNN  default positive-anchor parameter with Intersection over Union (IOU) threshold higher than 0.7, there are few positive anchors for text due to unmatched shapes and may induce difficult convergence in Region Proposal Network (RPN) during training. It is essential to redesign prior anchors sample strategy with proper anchor ratios and scales for more matched ground-truth positive anchors, namely True Positives (TP). Furthermore, inspired by recent architectures which demonstrate the effectiveness of and kernels including InceptionV3 , TextBoxes [24, 34], an inception text-aware RPN architecture as shown in Fig. 3
is proposed for rich and robust text feature extraction. Cascade R-CNN, a multi-stage extension of two-stage R-CNN object detection framework, is exploited to obtain high quality object detection, which can effectively reject close False Positives (FP) and improve the detection performance. Finally, The mask branch as in Mask R-CNN in the text detection module predicts the final masked arbitrary-shaped text regions for latter scene text recognition.
Regarding Cascade Mask R-CNN for text segmentation, we optimize the model by minimizing the multi-task loss function as below:
where N is the number of multiple cascade stages.
is label logits andis ground-truth one-hot labels.
is estimated bounding box transformation parameters andis the ground-truth. and are the same as Mask R-CNN . is the summation of multi-stage and with increasing IOU thresholds which represent the cross-entropy loss and smoothed loss in Fast R-CNN  respectively. is the weight decay factor and represents the regularization loss.
Regarding text recognition, it should be a cross-domain representation learning and alignment problem including computer vision and sequence transduction instead of simple image classification as far as we are concerned. The reason is that text recognition in realistic scenes always represents the linguistic significance. Namely, scene text recognition should benefit from language model based natural language process methods.
We aim to propose a universal sequence classification framework that is capable of integrating multi-lingual arbitrary-shaped text recognition in the unique recognition model. Long Short-term Memory (LSTM) born for sequence alignment is adopted for text sequence recognition in our framework. Furthermore, for purpose of irregular text recognition, we propose an visual attention based method referring to Bahdanau Attention Mechanism  which learns cross-modal alignment between visual feature and LSTM hidden state for classification in a semi-supervised way.
With the global word embedding matrix and CNN feature map , we can model the sequence classification as soft end-to-end attention mechanism based alignment problems between visual attention based context feature from by attention weight and the corresponding embedding in at each time step . Let be the context features and let be the embeddings ofcan be decomposed as as following:
where represents the maximum sequence length with the End of Sequence (EOS) character proposed in machine translation and is the expectation of previous embedding. Simultaneously is the previous LSTM hidden state and
is the context vector calculated according to the following attention mechanism formulas:
where the initial LSTM hidden state in computed from representing global feature and the LSTM architecture is the same as that in image caption . In addition, we choose Bahdanau Attention Mechanism  as our default formulated as following:
where and are the CNN feature map height and width respectively.
The final optimization function, namely Masked Cross Entropy (MCE) loss for dynamic sequence length, is formulated as below:
where is logits calculated from previous embedding , current context feature and LSTM hidden state as below:
is the one-hot ground-truth sequence labels, is the corresponding sequence mask for dynamic sequence labels as following:
In this section, we will present our instance segmentation based text detection and attention based text recognition and how to train the models in detail. We are not able to build an end-to-end STR framework due to the limited 1080Ti GPU memory. Our STR algorithm is implemented based on TensorFlow and its high-level API Tensorpack , with speed and flexibility built together.
We follow the default protocol in Mask R-CNN and Cascade R-CNN with ResNet101-FPN backbone besides some modified parameters for text detection. We use RPN batch size 128 and batch size 256 for ROIAlign head, half as the default Mask R-CNN parameters due to the less TP anchors in text detection. Note that for the purpose of data augmentation, we conduct random affine transformation to generate more text images including rotate, translate, shear, resize and crop transform.
Stochastic Gradient Descent (SGD) with momentum is adopted as default text segmentation optimizer, with learning rate of 0.01 and momentum of 0.9. We use a weight decay of 0.0001 and train on 8 GPUs with 1 image per GPU for 360k iterations, with the learning rate decreased by 10 at the 240k iteration and 320k iteration. We set the maximum image size to 1600 for accurate small text region. Training images are resized for multi-scale training with their scale (shorter edge) in the range of 640 to 1280 pixels. Pre-train detection model from ICDAR2017-MLT , ICDAR2017RCTW  is used for weight initialization for text detection. Furthermore, the RPN anchor ratios of [1/9, 1/4, 1/2, 1, 2, 4, 9] is adopt for extremely irregular text detection. Then the mask branch predicts the final segmented text region for latter text recognition.
In this part, we exploit InceptionV4 
as our text recognition backbone, which is appropriate for fine-grained distinguishable text feature extraction. Furthermore, referring to Google’s Neural Machine Translation (GNMT)[54, 30] and image caption methods
, we model the text recognition as sequence transduction between visual feature and sequence embedding by Bahdanau Attention based LSTM. In addition, EOS and sequence mask are adopted for sequence padding to transform various length sequences to fixed length.
For the purpose of universal multi-lingual arbitrary-shaped text recognition, we conduct random affine transformation as well as random color jittering with text image and corresponding polygons and then crop and resize image with longer size of 256 pixels. Finally, we random translate and flip augmented image and pad the irregular text region to size of with zero. The fixed maximum sequence length, namely LSTM sequence length, is set as 26 ( of 27 indeed with EOS). Compared with conventional LSTM, our LSTM takes not only previous hidden state and cell state, but alse the current attention feature and previous embedding as input. The LSTM size is 512 and embedding matrix shape is representing embedding size of 256 and 5435 characters including Chinese, English letters, digits and other symbols.
We train our recognition model via Adam , an adaptive learning rate optimizer, with learning rate of 0.0001. We use a weight decay of 0.00001 and train with 24 image batch size on 1 GPU for 460k iterations, with the learning rate decreased by 10 at the 320k iteration and 400k iteration. Pre-train InceptionV4 model from ImageNet is loaded for initialization and multiple datasets including LSVT, ArT, ReCTS, COCO-Text, ICDAR2017 are used to train the multi-lingual irregular text recognition model. Meanwhile, the padded fixed length sequences with corresponding sequence masks in GNMT make it possible to compute masked cross entropy loss for dynamic sequence length.
Fig. 6 shows the text detection and recognition results. As reported in , our algorithm achieves the championship in the task of scene text spotting with the accuracy H-mean score of 52.45% in Latin and 50.17% in Latin and Chinese. Normalized Edit Distance metric (1-N.E.D specifically) is treated as the official ranking metric is formulated as below:
where stands for the Levenshtein Distance, and and denote the predicted text line in string and the corresponding ground truths in the regions. Note that the results are tested in an unique model. Numeric results on Latin Only, Latin and Chinese are listed in Table 1 and Table 2, respectively. Moreover, we also achieve competitive results on ICDAR2019-LSVT in Table 3 and ICDAR2019-ReCTS in Table 4, which demonstrate the effectiveness and robustness of our STR algorithm. Note that the detection model is trained on different datasets seperately with pretrained detection model from ICDAR2017-MLT and ICDAR2017RCTW, while the recognition model is the same one as mentioned in Section 4.2.
This paper aims to propose a universal framework for STR with the combination of object detection and attention based language model. The competition results in ICDAR2019 including LSVT, ArT, ReCTS demonstrate the effectiveness and robustness of our algorithm. In addition, it is convenient to extend the current algorithm to state-of-the-art method such as replacing detection or attention mechanism with better architectures, such as fast and accurate detector EfficientDet  or Transformer  based self-attention mechanism. Moreover, it is feasible to synthesize text images from natural language process corpus for data augmentation and helpful for attention-based language model.
In future, it is imperative to build an end-to-end differentiable STR algorithm with both detection and recognition which requires GPUs with large memory like V100. It is essential to eliminate the detection failure case as illustrated in Fig. 7 with semantic information. Language model based methods like BERT  should be beneficial in our framework which takes the context of whole sentences into consideration instead of the previous word merely. Furthermore, visual based question answering or semantic analysis modules can be integrated with the framework for text based high-level semantic applications.
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6154–6162, 2018.
Inception-v4, inception-resnet and the impact of residual connections on learning.In Thirty-First AAAI Conference on Artificial Intelligence, 2017.