Scene Text Recognition from Two-Dimensional Perspective

Inspired by speech recognition, recent state-of-the-art algorithms mostly consider scene text recognition as a sequence prediction problem. Though achieving excellent performance, these methods usually neglect an important fact that text in images are actually distributed in two-dimensional space. It is a nature quite different from that of speech, which is essentially a one-dimensional signal. In principle, directly compressing features of text into a one-dimensional form may lose useful information and introduce extra noise. In this paper, we approach scene text recognition from a two-dimensional perspective. A simple yet effective model, called Character Attention Fully Convolutional Network (CA-FCN), is devised for recognizing text of arbitrary shapes. Scene text recognition is realized with a semantic segmentation network, where an attention mechanism for characters is adopted. Combined with a word formation module, CA-FCN can simultaneously recognize the script and predict the position of each character. Experiments demonstrate that the proposed algorithm outperforms previous methods on both regular and irregular text datasets. Moreover, it is proven to be more robust to imprecise localizations in the text detection phase, which are very common in practice.


page 2

page 3

page 5

page 7


A New Perspective for Flexible Feature Gathering in Scene Text Recognition Via Character Anchor Pooling

Irregular scene text recognition has attracted much attention from the r...

2D Attentional Irregular Scene Text Recognizer

Irregular scene text, which has complex layout in 2D space, is challengi...

2D-CTC for Scene Text Recognition

Scene text recognition has been an important, active research topic in c...

Scene Text Recognition via Transformer

Scene text recognition with arbitrary shape is very challenging due to l...

Chinese/English mixed Character Segmentation as Semantic Segmentation

OCR character segmentation for multilingual printed documents is difficu...

TextScanner: Reading Characters in Order for Robust Scene Text Recognition

Driven by deep learning and the large volume of data, scene text recogni...

Scene Text Detection via Holistic, Multi-Channel Prediction

Recently, scene text detection has become an active research topic in co...


Scene text recognition has been an active research field in computer vision because it is a critical element of a lot of real-world applications, such as street sign reading in the driverless vehicle, human computer interaction, assistive technologies for the blind and guide board recognition 

[Rong, Yi, and Tian2016]. As compared to the maturity of document recognition, scene text recognition is still a challenging task due to large variations in text shapes, fonts, colors, backgrounds, etc.

Benefiting from the development of deep learning, most of the recent works 

[Shi, Bai, and Yao2017, Shi et al.2016] convert scene text recognition into sequence recognition, which hugely simplifies the problem and leads to great performance on regular text. As shown in Fig. (a)a, they firstly encode the input image into a feature sequence and then apply decoders such as RNN [Hochreiter and Schmidhuber1997] and CTC [Graves et al.2006] to decode the target sequence. These methods produce good results when the text in the image is horizontal or nearly horizontal. However, different from speech, text in scene images is essentially distributed in a two-dimensional space. For example, the distribution of the characters can be scattered, in arbitrary orientations, and even in curve shapes, as shown in Fig. 1. In these cases, roughly encoding the images into one-dimensional sequences may lose key information or bring undesired noises. [Shi et al.2016]

tried to alleviate this problem by adopting a Spatial Transform Network (STN) 

[Jaderberg et al.2015b] to rectify the shape of the text. Nevertheless, [Shi et al.2016] still used a sequence-based model, so the effect of the rectification is limited.

As discussed above, the limitations of sequence-based methods are mainly caused by the difference between the one-dimensional distribution of feature sequences and the two-dimensional distribution of text in scene images. To overcome these limitations, we tackle the scene text recognition problem in a new and natural perspective. We propose to directly predict the text recognition result in a two-dimensional space instead of a one-dimensional sequence. Inspired by FCN [Long, Shelhamer, and Darrell2015], a Character Attention Fully Convolutional Network (CA-FCN) is proposed to predict the characters at pixel level. Then the final word, as well as the location of each character, can be obtained by a word formation module, as shown in Fig. (b)b. In this way, the procedures of compressing and slicing the features, which are widely used in the sequence-based methods, are avoided. Profiting from the higher dimensional perspective, the proposed method is much more robust than the previous sequence-based methods in terms of text shapes, background noises, and imprecise localizations from the detection stage. Character-level annotations are needed in our proposed method. However, the character annotations are free of labor because only public synthetic data is used in the training period, where the annotations are easy to obtain.

The contributions of this paper can be summarized as follows: (1) A totally different perspective for recognizing scene text is proposed. Different from the recent works which treat the text recognition problem as a sequence recognition problem in one-dimensional space, we propose to solve the problem in two-dimensional space. (2) We devise character attention FCN for scene text recognition. To the best of our knowledge, this is the first time to use a fully convolutional network [Long, Shelhamer, and Darrell2015] to deal with scene text recognition problem, which can deal with images of arbitrary height and width, as well as naturally recognize text in various shapes, including but not limited to oriented and curve shapes. (3) The proposed method achieves state-of-the-art performance on regular datasets and outperforms the existing methods with a large margin on irregular datasets. (4) We investigate the network’s robustness to imprecise localization in the text detection phase for the first time. This problem is important in real-world applications but was previously ignored. Experiments show that the proposed method is more robust to imprecise localization (see Sec. Ablation study).

Figure 1: Illustration of text recognition in one-dimensional and two-dimensional spaces. (a) shows the recognition procedures of sequence-based methods. (b) presents the proposed segmentation-based method. Different colors mean different character classes.

Related Work

Traditionally, scene text recognition systems firstly detect each character, using binarization or sliding-window operation, then recognize these characters as word. Binarization-based methods, such as Extremal Regions 

[Novikova et al.2012] and Niblack’s adaptive binarization [Bissacco et al.2013], find character pixels after binarization. However, text in the natural scene image may have varying backgrounds, fonts, colors or uneven illumination and so on, which binarization based methods can hardly handle. Sliding window methods use multi-scale sliding window strategy to localize characters from the text image directly, such as Random Ferns [Wang, Babenko, and Belongie2011], Integer Programming [Smith, Feild, and Learned-Miller2011]

and Convolutional Neural Networks (CNNs) 

[Jaderberg, Vedaldi, and Zisserman2014]

. For the word recognition stage, common methods are integrating contextual information with character classification scores, such as Pictorial Structure models, Bayesian inference, and Conditional Random Field (CRF), which are employed in

[Wang, Babenko, and Belongie2011, Weinman, Learned-Miller, and Hanson2009, Mishra, Alahari, and Jawahar2012a, Mishra, Alahari, and Jawahar2012b, Shi et al.2013].

Inspired by speech recognition, recent works designed an encoder-decoder framework, where text in images are encoded into feature sequences and then decoded as characters. With the development of the deep neural network, convolutional features are extracted at encoder stage, and then RNN or CNN network is applied to decode these features, then CTC is used to form the final word. This framework was proposed by [Shi, Bai, and Yao2017]. Later they also developed an attention-based STN for rectifying text distortion, which is useful to recognize curved scene text [Shi et al.2016]. Based on this framework, subsequent works [He et al.2016, Wu et al.2016, Liu, Chen, and Wong2018] also focus on irregular scene text.

The encoder-decoder framework has dominated current text recognition works. Many systems based on this framework have achieved state-of-the-art performance. However, text in scene images are distributed in a two-dimensional space, which is different from speech. The encoder-decoder framework just considers them as one-dimensional sequences, bringing some problems. For example, compressing a text image into a feature sequence may lose key information and add extra noise, especially when the text is curved or seriously distorted.

There are a few works that tried to improve some disadvantages of the encoder-decoder framework. For example, [Bai et al.2018]

found that when considering the scene text recognition problem under the attention-based encoder-decoder framework, the misalignment between the ground truth strings and the attention’s output sequences of probability distribution, which is caused by missing or superfluous characters, will confuse and mislead the training process. To handle this problem, they propose a method called edit probability which considered losses including not only the probability distribution but also the possible occurrences of missing/superfluous characters.

[Cheng et al.2018]

aimed to handle oriented text and realized that it is hard for the current encoder-decoder framework to capture the deep features of oriented text. To solve this problem, they encode the input image to four feature sequences of four directions to extract scene text features in those directions.

In this paper, we consider text recognition from the two-dimensional perspective and design a character attention FCN to deal with text recognition problem, which can naturally avoid those disadvantages of encoder-decoder framework. For example, the character attention FCN we design can extract deep features of irregular text directly. And there is no misalignment proAnd the pixel level output of our network doesn’t have It obtains high accuracy on both regular and irregular text, and is also more robust to imprecise localization in the text detection phase.

Figure 2: Illustration of the CA-FCN. The blue feature maps in the left are inherited from the VGG-16 backbone; The yellow feature maps in the right are extra layers. H, W mean the height and width of the input image; C is the number of classes.



The whole architecture of our proposed method consists of two parts. The first part is a Character Attention FCN (CA-FCN) which predicts the characters at pixel level. Another part is a word formation module which groups and arranges the pixels to form the final word result.

Character attention FCN

The architecture of CA-FCN is basically a fully convolutional network, as shown in Fig. 2. We use VGG-16 as the backbone while dropping the fully connected layers and removing its pooling layers of stage-4 and stage-5. Besides, a pyramid-like structure [Lin et al.2017] is adopted to handle varying scales of characters. The final output is of shape , where , are the height and width of the input image and is the number of classes including character categories and background. CA-FCN predicts characters in a two-dimensional space, thus can handle text of various shapes.

Character attention module

Attention module plays an important role in our network. Natural scene text recognition suffers from complex backgrounds, shadow, irrelevant symbols and so on. Moreover, characters in natural images are usually crowded, which can hardly be separated. To deal with those problems, inspired by [Wang et al.2017], we propose a character attention module to highlight the foreground characters and weaken the background, as well as separate adjacent characters, as illustrated in Fig. 2

. Attention module is appended to each output layer of VGG16. The low-level attention models mainly focus on the appearance, such as edge, color, and texture. And the high-level modules can extract more semantic information. The character attention module can be expressed as follows:


where and are the input and output feature map respectively; indicates the attention map; means element-wise multiplication. The attention map is generated by two convolutional layers and a two-class (characters and background) soft-max function where represents background and indicates characters. The attention map is broadcast to the same shape as to achieve element-wise multiplication. Compared with [Wang et al.2017], our character attention module uses a simpler network structure, profiting from the character supervision. The effectiveness of the character attention module is discussed in Sec. Ablation study by experiments.

Figure 3: Illustration of our deformable convolution. (a) normal convolution; (b) deformable convolution with convolution. The green boxes indicate convolutional kernels. The yellow boxes mean the regions covered by receptive fields. The receptive fields out of the image are clipped.

Deformable convolution

As shown in Fig. 2, deformable convolution [Dai et al.2017] is applied in stage-4 and stage-5. The deformable convolution learns offsets of the convolution kernel, which provides more flexible receptive fields for the character prediction. The kernel size of deformable convolution is set to as default. The kernel size of the convolution after the deformable convolution is set to . In Fig. 3, there is a toy description of normal convolution, the deformable convolution with convolutional kernel, as well as their receptive fields. The image in Fig. 3 is an expanded text image where more background is included in the image. Since most of the training images are cropped with tight bounding boxes, and the normal convolution contains a lot of character information due to the fixed receptive field, it tends to predict the extra background as a character. However, if deformable convolution and convolution kernel are applied, with better and more flexible receptive field, the extra background can be predicted correctly. Note that the extra background is very common in real-world applications as the detection results may be inaccurate. Thus, the robustness on expanded text images is significant. The effectiveness of the deformable convolution is discussed in Sec. Ablation study by experiments.


Label generation

Let be the original bounding boxes of characters, which can be expressed as the minimum axis-aligned rectangle boxes that covers the characters. The ground truth character regions can be calculated as follows:


where is the shrink ratio of the character regions. We shrink the character regions because the adjacent characters tend to be overlapped without shrinking. The shrink process can reduce the difficulty of the word formation. Specifically, we set to and for the attention supervision and the final output supervision respectively.

Figure 4: Illustration of ground truth generation. (a) Original bounding boxes; (b) Ground truth for character attention; (c) Ground truth for character prediction, where different colors represent different character classes.
Loss function

The loss function is a weighted sum of the character prediction loss function

and the character attention loss function :


where indicates the index of the stages, as shown in Fig. 2; is empirically set to .

The final output of the CA-FCN is of shape , where , are the height and width of an input image respectively. is the number of classes including character classes and background. Assume that is one of the element of the output map, where , , and ; indicates the corresponding class label. The prediction loss can be calculated as follows:


where is the corresponding weight of each pixel. Assume that and is the number of background pixels. The weight can be calculated as follows:


The character attention loss function is a binary cross entropy loss function which take all characters labels as , background label as :


where and are the height and width of the feature map in the corresponding stage respectively.

Word formation module

The word formation module converts the accurate, two-dimensional character maps predicted by CA-FCN into character sequence. As shown in Fig. 5, we firstly transform the character prediction map into a binary map with a threshold to extract the corresponding character regions; then, we calculate the average values of each region for classes and assign the class with the largest average value to the corresponding region; finally, the word is formed by sorting the regions from left to right. In this way, both the word and location of each character are produced. The word formation module assumes that words are roughly sorted from left to right, which may not work in certain scenarios. However, if necessary, a learnable component can be plugged into CA-FCN. The word formation module is simple yet effective, with only one hyper-parameter (the threshold to form binary map), which is set to for all experiments.

Figure 5: Illustration of the word formation module.
Figure 6: Visualization of character prediction maps on IIIT and CUTE. The character prediction map generated by the CA-FCN is visualized with colors.



Our proposed CA-FCN is purely trained on the synthetic datasets without real-world images. The trained model, without further fine-tuning, was evaluated on 4 benchmarks including regular and irregular text datasets.

SynthText is a synthetic text dataset proposed in [Gupta, Vedaldi, and Zisserman2016]. It contains 800,000 training images which are aimed at text detection. We crop them based on their word bounding boxes. It generates about 7 million images for text recognition. These images are with character-level annotations.

IIIT5k-Words (IIIT) [Mishra, Alahari, and Jawahar2012b]

consists of 3000 test images collected from the web. It provides two lexicons for each image in the dataset, which contains 50 words and 1000 words respectively.

Street View Text (SVT) [Wang, Babenko, and Belongie2011] comes from the Google Street View. The test set consists of 647 images. It is challenging due to its low resolution and noises. A 50-word lexicon is given for each image.

ICDAR 2013 (IC13) [Karatzas et al.2013] contains 1015 images and no lexicon is provided. We remove images that contain non-alphanumeric characters or have less than three characters, following previous works.

CUTE [Risnumawan et al.2014] is a dataset consists of 288 images with a lot of curved text. It is challenging because the shapes vary hugely. No lexicon is provided.

Implementation details


Since our network is fully convolutional, there is no restriction on the size of input images. We adopt multi-scale training to make our model more robust. The input images are randomly resized to , , and . Besides, data augmentation is also applied in the training period, including random rotation, hue, brightness, contrast and blur. Specifically, we randomly rotate the image with an angle in the range of . We use Adam [Kingma and Ba2014] to optimize our training with the initial learning rate . The learning rate is decreased to and

at epoch 3 and epoch 4. The model is totally trained for about 5 epochs. The number of character classes is set to

, including alphabet, digitals, special character which represents those characters out of alphabet and digitals, and background.


At runtime, images are resized to , where is fixed to and is calculated as follows:


where and are the height and width of the origin images.

The speed is about FPS on IC13 dataset with batch size of 1, where the CA-FCN costs second per image and word formation module costs second per image on average. Higher speed can be achieved if the batch size increases. We test our method with a single Titan Xp GPU.

50 1k 0 50 0 0 0
[Wang, Babenko, and Belongie2011] - - - 57.0 - - -
[Mishra, Alahari, and Jawahar2012a] 64.1 57.5 - 73.2 - - -
[Wang et al.2012] - - - 70.0 - - -
[Almazán et al.2014] 91.2 82.1 - 89.2 - - -
[Yao et al.2014] 80.2 69.3 - 75.9 - - -
[Rodríguez-Serrano, Gordo, and Perronnin2015] 76.1 57.4 - 70.0 - - -
[Jaderberg, Vedaldi, and Zisserman2014] - - - 86.1 - - -
[Su and Lu2014] - - - 83.0 - - -
[Gordo2015] 93.3 86.6 - 91.8 - - -
[Jaderberg et al.2016] 97.1 92.7 - 95.4 80.7 90.8 -
[Jaderberg et al.2015a] 95.5 89.6 - 93.2 71.7 81.8 -
[Shi, Bai, and Yao2017] 97.8 95.0 81.2 97.5 82.7 89.6 -
[Shi et al.2016] 96.2 93.8 81.9 95.5 81.9 88.6 59.2
[Lee and Osindero2016] 96.8 94.4 78.4 96.3 80.7 90.0 -
[Wang and Hu2017] 98.0 95.6 80.8 96.3 81.5 - -
[Yang et al.2017] 97.8 96.1 - 95.2 - - 69.3
[Cheng et al.2017] 99.3 97.5 87.4 97.1 85.9 93.3 -
[Cheng et al.2018] 99.6 98.1 87.0 96.0 82.8 - 76.8
[Bai et al.2018] 99.5 97.9 88.3 96.6 87.5 94.4 -
Ours 99.8 98.9 92.0 98.5 82.1 91.4 78.1
Ours+data 99.8 98.8 91.9 98.8 86.4 91.5 79.9
Table 1: Results across different methods and datasets. “50” and “1k” indicate the sizes of the lexicons. “0” means no lexicon. “data” indicates using extra synthetic data to fine-tune the model.

Performances on benchmarks

We evaluate our method on several benchmarks to indicate the superiority of the proposed method. Some results of IIIT and CUTE are visualized in Fig. 6. As can be seen, our proposed method can handle various shapes of text.

Quantitative results are listed in Tab. 1. Compared to previous methods, our proposed method achieves state-of-the-art performance on most of those benchmarks. More specifically, “Ours” outperforms the previous state-of-the-art by percents on IIIT without lexicons. On irregular text dataset CUTE, percents improvement is achieved by “Ours”. Note that no extra training data for curved text is included to achieve this performance. Comparable results are also performed on other datasets, including SVT, IC13.

The training data of [Cheng et al.2017] consist of two synthetic datasets including Synth90k [Jaderberg et al.2014] and SynthText [Gupta, Vedaldi, and Zisserman2016]. The former is generated according to a large lexicon which contains the lexicon of SVT and ICDAR, while the latter uses a normal corpus, where the distribution of words are not balanced. To fairly compared with [Cheng et al.2017], we also generate extra 4 million synthetic images using the algorithm of SynthText with the lexicon used in Synth90k. As shown in Tab. 1, after fine-tuning with the extra data, “Ours+data” also outperforms [Cheng et al.2017] on SVT.

[Bai et al.2018] improves  [Cheng et al.2017, Shi et al.2016] by solving their misalignment problem and achieves excellent results in regular text recognition. However, it may fail in irregular text benchmarks such as CUTE due to its one-dimensional perspective. Moreover, we argue that our method can be further improved if the idea of [Bai et al.2018] is well adapted to our word formulation module. Nevertheless, our method outperforms [Bai et al.2018] on most of the benchmarks in Tab. 1, especially on IIIT and CUTE.

[Cheng et al.2018] focuses on dealing with arbitrary-oriented text by introducing four one-dimensional feature sequences with different directions adaptively. Our method is more superior in recognizing the text of irregular shapes such as curve shape. As shown in Tab. 1, our method outperforms [Cheng et al.2018] on all benchmarks.

Figure 7: Visualization of the character prediction maps on expanded datasets. Red: wrong results; Green: correct results.
Methods IIIT IIIT-p IIIT-r-p IC13 IC13-ex IC13-r-ex
ac ac gap ratio ac gap ratio ac ac gap ratio ac gap ratio
CRNN 81.2 76.0 -5.2 6.4% 72.4 -8.8 10.8% 89.6 81.9 -7.7 8.6% 76.7 -12.9 14.4%
ACSM 85.4 79.1 -6.3 7.4% 74.9 -10.5 12.3% 88.0 81.2 -6.8 7.7% 70.0 -18.0 20.5%
baseline 90.5 87.0 -3.5 3.9% 85.7 -4.8 5.3% 90.5 83.2 -7.3 8.1% 82.3 -8.2 9.1%
baseline + attention 91.0 86.7 -4.3 4.7% 85.7 -5.3 5.8% 90.1 85.6 -4.5 5.0% 83.0 -7.1 7.9%
baseline + deform 91.4 87.6 -3.8 4.2% 86.7 -4.7 5.1% 91.1 87.4 -3.7 4.1% 84.2 -6.9 7.6%
baseline + attention + deform 92.0 89.3 -2.7 2.9% 87.6 -4.4 4.8% 91.4 87.2 -4.2 4.6% 83.8 -7.6 8.3%
Table 2: Experimental results on expanded datasets. “ac”: accuracy; “gap”: the gap between the original dataset; “ratio” indicates the decrease ratio compared to the accuracy on the original dataset.

Ablation study

Scene text recognition is usually a following step of scene text detection, whose results may be not as accurate as expected. Thus, performances of text spotting systems in real-world applications are significantly affected by the robustness of text recognition algorithms on expanded images. We conduct experiments with expanded datasets to show the effect of text bounding box variance on recognition and prove the robustness of our method.

For the datasets which have the original background, such as IC13, we expand their bounding boxes and then crop them from the original images. If no extra background is provided like IIIT, padding by repeating the border pixels is applied to these images. The expanded datasets are described below:

IIIT-p Padding the images in IIIT with extra height vertically and width horizontally by repeating the border pixels.

IIIT-r-p Separately stretching the four vertexes of the images in IIIT with a random scale up to of height and width respectively, border pixels are repeated to fill the quadrilateral images, and then images are transformed back to axis-aligned rectangles.

IC13-ex Expanding the bounding boxes of the images in IC13 to expanded rectangles with extra height and width before cropping.

IC13-r-ex Expanding the bounding boxes of the images in IC13 randomly with maximum of width and height to form expanded quadrilateral images. Then the pixels in axis-aligned circumscribed rectangles of those images are cropped.

We compare our method with two representative sequence-based models including CRNN [Shi, Bai, and Yao2017] and Attention Convolutional Sequence Model (ACSM) [Gao et al.2017]. The model of CRNN is provided by its authors and the model of [Gao et al.2017] is re-implemented by ourself with the same training data as ours. Qualitative results of three methods are visualized in Fig. 7. As can be observed, the sequence-based models usually predict extra characters if the images are expanded while CA-FCN is stable and robust.

The quantitative results are listed in Tab. 2. Compared to the sequence-based models, our proposed method is more robust among these expanding datasets. For example, on IIIT-p dataset, the gap ratio of CRNN is while ours is only . Note that even though our performances on the standard datasets are higher, the gaps of ours are still much smaller than CRNN. As shown in Tab. 2, both the deformable module and the attention module can improve the performance and the former also contributes to the robustness of the model. It indicates the effectiveness of the deformable convolution and the character attention module.

The possible reasons that our proposed method is more robust than sequence-based models on expanded images could be: Sequence-based models are in one-dimensional perspective, which is hard to endure extra background because the background noises are easy to encode into the feature sequence. In contrast, our method predicts characters in a two-dimensional space, where both characters and backgrounds are the target predicting objects. The extra backgrounds is less likely to mislead the prediction of the characters. In addition, our model is based on FCN, which is translation invariant.


In this paper, we have presented a method called Character Attention FCN (CA-FCN) for scene text recognition, which models the problem in a two-dimensional fashion. By performing character classification at each pixel location, the algorithm can effectively recognize irregular as well as regular text instances. Experiments show that the proposed model outperforms existing methods on datasets with regular and irregular text. We also analyzed the impact of imprecise text localization to the performances of text recognition algorithms, and proved that our method is much more robust. For future research, we will make the word formation module learnable and build an end-to-end text spotting system.


  • [Almazán et al.2014] Almazán, J.; Gordo, A.; Fornés, A.; and Valveny, E. 2014. Word spotting and recognition with embedded attributes. IEEE Trans. Pattern Anal. Mach. Intell. 36(12):2552–2566.
  • [Bai et al.2018] Bai, F.; Cheng, Z.; Niu, Y.; Pu, S.; and Zhou, S. 2018. Edit probability for scene text recognition. In Proc. CVPR.
  • [Bissacco et al.2013] Bissacco, A.; Cummins, M.; Netzer, Y.; and Neven, H. 2013. Photoocr: Reading text in uncontrolled conditions. In Proc. ICCV.
  • [Cheng et al.2017] Cheng, Z.; Bai, F.; Xu, Y.; Zheng, G.; Pu, S.; and Zhou, S. 2017. Focusing attention: Towards accurate text recognition in natural images. In Proc. ICCV, 5086–5094.
  • [Cheng et al.2018] Cheng, Z.; Xu, Y.; Bai, F.; Niu, Y.; Pu, S.; and Zhou, S. 2018. Aon: Towards arbitrarily-oriented text recognition. In Proc. CVPR, 5571–5579.
  • [Dai et al.2017] Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; and Wei, Y. 2017. Deformable convolutional networks. In Proc. ICCV, 764–773.
  • [Gao et al.2017] Gao, Y.; Chen, Y.; Wang, J.; and Lu, H. 2017. Reading scene text with attention convolutional sequence modeling. CoRR abs/1709.04303.
  • [Gordo2015] Gordo, A. 2015. Supervised mid-level features for word image representation. In CVPR.
  • [Graves et al.2006] Graves, A.; Fernández, S.; Gomez, F. J.; and Schmidhuber, J. 2006.

    Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks.

    In Proc. ICML, 369–376.
  • [Gupta, Vedaldi, and Zisserman2016] Gupta, A.; Vedaldi, A.; and Zisserman, A. 2016. Synthetic data for text localisation in natural images. In Proc. CVPR.
  • [He et al.2016] He, P.; Huang, W.; Qiao, Y.; Loy, C. C.; and Tang, X. 2016. Reading scene text in deep convolutional sequences. In Proc. AAAI.
  • [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9(8):1735–1780.
  • [Jaderberg et al.2014] Jaderberg, M.; Simonyan, K.; Vedaldi, A.; and Zisserman, A. 2014. Synthetic data and artificial neural networks for natural scene text recognition. CoRR abs/1406.2227.
  • [Jaderberg et al.2015a] Jaderberg, M.; Simonyan, K.; Vedaldi, A.; and Zisserman, A. 2015a. Deep structured output learning for unconstrained text recognition. In Proc. ICLR.
  • [Jaderberg et al.2015b] Jaderberg, M.; Simonyan, K.; Zisserman, A.; et al. 2015b. Spatial transformer networks. In Proc. NIPS, 2017–2025.
  • [Jaderberg et al.2016] Jaderberg, M.; Simonyan, K.; Vedaldi, A.; and Zisserman, A. 2016. Reading text in the wild with convolutional neural networks. IJCV 116(1):1–20.
  • [Jaderberg, Vedaldi, and Zisserman2014] Jaderberg, M.; Vedaldi, A.; and Zisserman, A. 2014. Deep features for text spotting. In Proc. ECCV.
  • [Karatzas et al.2013] Karatzas, D.; Shafait, F.; Uchida, S.; Iwamura, M.; i Bigorda, L. G.; Mestre, S. R.; Mas, J.; Mota, D. F.; Almazan, J. A.; and de las Heras, L. P. 2013. Icdar 2013 robust reading competition. In Proc. ICDAR, 1484–1493.
  • [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. CoRR abs/1412.6980.
  • [Lee and Osindero2016] Lee, C., and Osindero, S. 2016. Recursive recurrent nets with attention modeling for OCR in the wild. In Proc. CVPR, 2231–2239.
  • [Lin et al.2017] Lin, T.; Dollár, P.; Girshick, R. B.; He, K.; Hariharan, B.; and Belongie, S. J. 2017. Feature pyramid networks for object detection. In Proc. CVPR, 936–944.
  • [Liu, Chen, and Wong2018] Liu, W.; Chen, C.; and Wong, K. K. 2018. Char-net: A character-aware neural network for distorted scene text recognition. In Proc. AAAI.
  • [Long, Shelhamer, and Darrell2015] Long, J.; Shelhamer, E.; and Darrell, T. 2015. Fully convolutional networks for semantic segmentation. In Proc. CVPR, 3431–3440.
  • [Mishra, Alahari, and Jawahar2012a] Mishra, A.; Alahari, K.; and Jawahar, C. V. 2012a. Scene text recognition using higher order language priors. In Proc. BMVC.
  • [Mishra, Alahari, and Jawahar2012b] Mishra, A.; Alahari, K.; and Jawahar, C. V. 2012b. Top-down and bottom-up cues for scene text recognition. In Proc. CVPR.
  • [Novikova et al.2012] Novikova, T.; Barinova, O.; Kohli, P.; and Lempitsky, V. S. 2012. Large-lexicon attribute-consistent text recognition in natural images. In Proc. ECCV.
  • [Risnumawan et al.2014] Risnumawan, A.; Shivakumara, P.; Chan, C. S.; and Tan, C. L. 2014. A robust arbitrary text detection system for natural scene images. Expert Syst. Appl. 41(18):8027–8048.
  • [Rodríguez-Serrano, Gordo, and Perronnin2015] Rodríguez-Serrano, J. A.; Gordo, A.; and Perronnin, F. 2015. Label embedding: A frugal baseline for text recognition. Int. J. Comput. Vision 113(3):193–207.
  • [Rong, Yi, and Tian2016] Rong, X.; Yi, C.; and Tian, Y. 2016. Recognizing text-based traffic guide panels with cascaded localization network. In Proc. ECCV Workshops, 109–121.
  • [Shi, Bai, and Yao2017] Shi, B.; Bai, X.; and Yao, C. 2017. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11):2298–2304.
  • [Shi et al.2013] Shi, C.; Wang, C.; Xiao, B.; Zhang, Y.; Gao, S.; and Zhang, Z. 2013. Scene text recognition using part-based tree-structured character detection. In Proc. CVPR.
  • [Shi et al.2016] Shi, B.; Wang, X.; Lyu, P.; Yao, C.; and Bai, X. 2016. Robust scene text recognition with automatic rectification. In Proc. CVPR.
  • [Smith, Feild, and Learned-Miller2011] Smith, D. L.; Feild, J. L.; and Learned-Miller, E. G. 2011. Enforcing similarity constraints with integer programming for better scene text recognition. In Proc. CVPR.
  • [Su and Lu2014] Su, B., and Lu, S. 2014. Accurate scene text recognition based on recurrent neural network. In Proc. ACCV.
  • [Wang and Hu2017] Wang, J., and Hu, X. 2017. Gated recurrent convolution neural network for OCR. In Proc. NIPS, 334–343.
  • [Wang, Babenko, and Belongie2011] Wang, K.; Babenko, B.; and Belongie, S. J. 2011. End-to-end scene text recognition. In Proc. ICCV.
  • [Wang et al.2012] Wang, T.; Wu, D. J.; Coates, A.; and Ng, A. Y. 2012. End-to-end text recognition with convolutional neural networks. In Proc. ICPR.
  • [Wang et al.2017] Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Wang, X.; and Tang, X. 2017. Residual attention network for image classification. In Proc. CVPR, 6450–6458.
  • [Weinman, Learned-Miller, and Hanson2009] Weinman, J. J.; Learned-Miller, E. G.; and Hanson, A. R. 2009. Scene text recognition using similarity and a lexicon with sparse belief propagation. IEEE Trans. Pattern Anal. Mach. Intell. 31(10):1733–1746.
  • [Wu et al.2016] Wu, R.; Yang, S.; Leng, D.; Luo, Z.; and Wang, Y. 2016. Random projected convolutional feature for scene text recognition. In Proc. ICFHR.
  • [Yang et al.2017] Yang, X.; He, D.; Zhou, Z.; Kifer, D.; and Giles, C. L. 2017. Learning to read irregular text with attention mechanisms. In Proc. IJCAI, 3280–3286.
  • [Yao et al.2014] Yao, C.; Bai, X.; Shi, B.; and Liu, W. 2014. Strokelets: A learned multi-scale representation for scene text recognition. In Proc. CVPR.