TextSR: Content-Aware Text Super-Resolution Guided by Recognition

09/16/2019 ∙ by Wenjia Wang, et al. ∙ 3

Scene text recognition has witnessed rapid development with the advance of convolutional neural networks. Nonetheless, most of the previous methods may not work well in recognizing text with low resolution which is often seen in natural scene images. An intuitive solution is to introduce super-resolution techniques as pre-processing. However, conventional super-resolution methods in the literature mainly focus on reconstructing the detailed texture of natural images, which typically do not work well for text due to the unique characteristics of text. To tackle these problems, in this work, we propose a content-aware text super-resolution network to generate the information desired for text recognition. In particular, we design an end-to-end network that can perform super-resolution and text recognition simultaneously. Different from previous super-resolution methods, we use the loss of text recognition as the Text Perceptual Loss to guide the training of the super-resolution network, and thus it pays more attention to the text content, rather than the irrelevant background area. Extensive experiments on several challenging benchmarks demonstrate the effectiveness of our proposed method in restoring a sharp high-resolution image from a small blurred one, and show that the recognition performance clearly boosts up the performance of text recognizer. To our knowledge, this is the first work focusing on text super-resolution. Code will be released in https://github.com/xieenze/TextSR.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 5

page 7

page 8

page 9

page 11

page 12

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Scene text recognition is a fundamental and important task in computer vision, since it is usually a key step towards many downstream text-related applications, including document retrieval, card recognition, and many other Natural Language Processing (NLP) related applications. Text recognition has achieved remarkable success due to the development of Convolutional Neural Network (CNN) and text detection 

[46, 44, 45]. Many accurate and efficient methods have been proposed for most constrained scenarios (e.g., text in scanned copies or network images). Recent works focus on texts in natural scenes, which is much more challenging due to the high diversity of texts in blur, orientation, shape and low-resolution. A thorough survey on recent advantages of text recognition can be found in  [28].

Modern text recognizers have achieved impressive results on clear text images. However, their performances drop sharply when recognizing blurred text caused by low resolution or camera shake. The main difficulty to recognize blurred text is the lack of detailed information about them. Super-resolution [24] is a plausible method to tackle this problem. However, traditional super-resolution methods aim to reconstruct the detailed texture of natural images, which is not applicable to the blurred text. Compared to the texture of natural images, scenes texts are of arbitrary poses, illumination and blur, super-resolution on scene text images is much more challenging. Therefore, we need a content-aware text super-resolution network to generate clear, sharp and identifiable text images for recognition.

To deal with these problems, we propose a content-aware text super-resolution network (TextSR), which combines a super-resolution network and a text recognition network. TextSR is an end-to-end network, in which the results of the text recognition network can feed back to guide the training of the super-resolution network. Under the guidance of the text recognition network, the super-resolution network would focus on refining the text region, and thus generate clear, sharp and identifiable text images. As shown in Fig. 1

, there are three main components in our network: generator, discriminator and text recognizer. In the generator, a super-resolution network is used to up-sample the small blurred text to a fine scale for recognition. Compared with bilinear and bicubic interpolation, the generator can partly reduce artifacts and improve the quality of up-sampled images with a large up-scaling factors. In the discriminator, a classification network is applied to distinguish the high-resolution image and generated super-resolution image for adversarial training. Nevertheless, even with such a sophisticated generator and discriminator, up-sampled images are unsatisfactory, usually blurring and lacking fine details, due to the lack of the text content-aware capability. Therefore, we introduce a novel Text Perceptual Loss (TPL) to make the generator produce identifiable and clear text images. The TPL is provided by the text recognizer to guide the generator to produce clear texts for easier recognition.

The contributions of this work are therefore three-fold: (1) We introduce a super-resolution method to facilitate scene text recognition, especially for small blurred text. (2) We propose a novel Text Perceptual Loss to make the generator be aware of content of text and produce recognition-friendly information. (3) We demonstrate the effectiveness of our proposed method on several challenging public benchmarks and outperforms the strong baseline approaches.

Figure 1: The overall architecture of TextSR. It contains three components. The first line is the generator for super-resolution. The second line is the discriminator to distinguish whether the image is high-resolution or generated super-resolution. The last line is the recognizer and TPL to help optimize the generator to produce content-aware text images.

Related Work

Super-Resolution Super-resolution aims to output a plausible high-resolution image that is consistent with a given low-resolution image. Classical approaches, such as bilinear, bicubic or designed filtering, leverage the insight that neighbouring pixels usually exhibit similar colours and generate the output by interpolating between the colours of neighbouring pixels according to a predefined formula. Data-driven approaches make use of training data to assist with generating high-resolution predictions and directly copy pixels from the most similar patches in the training dataset [7]. In contrast, optimization-based approaches, formulate the problem as a sparse recovery problem, where the dictionary is learned from data [12]

. In the deep learning era, super-resolution is simply treated as a regression problem, where the input is the low-resolution image, and the target output is the high-resolution image 

[5]. A deep neural net is trained on the input and target output pairs to minimize some distance metric between the prediction and the ground truth. Subsequently, it was shown that enabling the network to learn the up-scaling filters directly can further increase performance both in terms of accuracy and speed [6].

Generative Adversarial Network Generative adversarial network (GAN) [8] introduces the adversarial loss to generate realistic-appearance images from random noises and achieves impressive results in many generating tasks, such as image generation, image editing, representation learning, image annotation, character transferring [15]. GAN is firstly applied to super-resolution by [24] (SRGAN) and obtains promising results on natural images. However, the high resolution images directly generated by SRGAN lack fine details desired for the recognition task. Therefore, we put forward content-aware super-resolution pipeline to recover the information friendly for text recognition.

Text Recognition The task of text recognition is to recognize character sequences from the cropped word image patches. With the rapid development in deep learning, a large number of effective frameworks have emerged in text recognition. Early work [20]

adopts a bottom-up fashion which detects individual characters firstly and integrates them into a word, or a top-down manner, which treats the word image patch as a whole and recognizes it as a multi-class image classification problem. Considering that scene text generally appears as a character sequence, following works regard it as a sequence recognition problem and employs Recurrent Neural Network (RNNs) to model the sequential features. For instance, the Connectionist Temporal Classification(CTC) 

[10]

loss is often combined with the RNN outputs for calculating the conditional probability between the predicted sequences and the target 

[26, 27]. Recently, an increasing number of recognition approaches based on the attention mechanism have achieved significant improvements [4, 30]. Among them, ASTER [38]

rectified oriented or curved text based on Spatial Transformer Network(STN) 

[19] and then performed recognition using an attentional sequence-to-sequence model. In this work, we choose ASTER as our baseline.

Method

In this section, we present our proposed method in detail. First, we give a brief description on the SRGAN. Then, the overall architecture of our method TextSR is presented, as shown in Fig. 1 and our novel Text Perceptual Loss.

Srgan

The goal of super-resolution is to train a function that estimates for a given low-resolution input image its corresponding high-resolution counterpart. To achieve this, a generator network

, parametrized by

, is optimized by super-resolution specific loss function

on training images and corresponding :

(1)

SRGAN defines as the weighted sum of the content loss and the adversarial loss. Instead of employing the pixel-wise loss, it uses the perceptual loss [21] as the content loss. The adversarial loss is implemented by a discriminator network , parametrized by , which is optimized in an alternating manner along with to solve the adversarial min-max problem:

(2)

The traditional way of leveraging SRGAN to help the task of text recognition is to separately train a generator that transforms the low-resolution image to high-resolution under the guidance of the adversarial loss. However, such an conventional generator may focus on reconstructing the detailed texture of natural images, which is not applicable to the text content. A more effective text super-resolution network needs a content-aware generator to produce clear, sharp and identifiable text images for recognition, rather than more details of the irrelevant background area. Therefore, we propose our method, TextSR.

Network Architecture

Our generator network up-samples a low-resolution image and outputs a super-resolution image. Second, a discriminator is used to distinguish whether the images belong to SR or HR. Furthermore, we add an additional text recognition branch to guide the generator produce content-aware images.

Generator network. As shown in Fig. 1, we adopt a deep CNN architecture which has shown effectiveness for image super-resolution in [24]. There are two deconvolutional layers in the network, and each layer consists of learned kernels which perform up-sampling a low-resolution image to a 2

super-resolution image. We use the batch normalization and rectified linear unit (ReLU) activation after each convolutional layer except the last layer. The generator network can up-sample a low-resolution image and outputs a 4

super-resolution image.

Discriminator network. We employ the same architecture as in [24] for our backbone network in the discriminator, as shown in Fig. 1. The input is the super-resolution image or HR image, and the output is the probability of the input being an HR image.

Recognition Network. To show the effectiveness of our method, we adopt ASTER [38] to be our base recognition network. ASTER is a state-of-the-art text recognizer composed of a text rectification network and a text recognition network. The text rectification network is able to rectify the character arrangement in the input image by using Thin-Plate-Spline [3], a non-rigid deformation transformer to rearrange the irregular text into horizontal one.

Based on the rectified text image, the text recognition network directly predicts the sequence of characters through sequence-to-sequence translation. The text recognition network consists of two parts: encoder and decoder. The encoder is used to extract the feature of the rectified text image. It consists of residual blocks [14]. Following the residual blocks, the feature map is converted into a feature sequence by being split along its row axis. There are two layers of Bidirectional LSTM (BLSTM) [11] to capture long-range dependencies in both directions. Each one consists of a pair of LSTMs. The outputs of the LSTMs are concatenated and through linearly projected layers before entering the next layer. The decoders are attentional LSTMs to recognizes 94 character classes, including digits, upper-case and lower-case letters, and 32 ASCII punctuation marks. At each time step, it predicts either a character or an end-of-sequence symbol (EOS). The loss of text recognition is defined as:

(3)

where is the length of symbol sequence, is the generated super-resolution image, is the ground-truth text represented by a character sequence, is the corresponding output of decoder.

Text Perceptual Loss

Training generator network by only the adversarial loss from discriminator leads to the generator focusing on reconstructing the detailed texture of images without understanding the content in the text image. It works on natural images, but it is not practical on text images because content is more important than texture in text images. Here, we introduce Text Perceptual Loss (TPL) inspired by the perceptual loss, which is widely used in super-resolution and other low-level vision tasks. The perceptual loss uses a pre-trained VGG [39]

network and calculate the similarity of the feature map of super-resolution images and original images. Perceptual loss can make network understand the general content of the image since VGG is pre-trained on ImageNet, which contains 1000 kinds of objects.

To this end, we carefully design TPL by back-propagating the loss of text recognition into the training of the generator network. Specifically, the super-resolution images produced by the generator is directly put into text recognizer. As a result, the generator is trained to minimize on training images and corresponding text characters

(4)

TPL helps the generator produce more content-realistic images, which are more friendly for the text recognizer. There are three approaches to supervise the generator by TPL. Details are shown as belows:

  • End-to-end training ASTER and generator simultaneously from random initialization.

  • Training ASTER first and then end-to-end training ASTER and generator simultaneously.

  • Training ASTER first and then training generator while freezing the parameters of ASTER.

Our experiments find that they result in very similar performance so we only report the results of third approach.

During training, we prepare the LR-HR image pairs and train three sub-nets shown at the blue line in Fig. 1. During inference, we select those images which size are smaller than 12832, then put them into the generator. Finally these restored images are recognized by ASTER. The inference step is shown at red lines in Fig. 1.

Figure 2: Our TextSR Super-resolution results compared with BUCUBIC and SRGAN. It shows that our TextSR generated images can restore the barely recognizable inputs precisely in the detailed textures and shapes, which can remarkably boost up the recognition accuracy.

Experiments

We evaluate the proposed method on several challenging public benchmarks and compare its performance with other state-of-the-art methods.

Datasets

The proposed model is trained on a combination of a synthetic dataset and the training set of ICDAR, without fine-tuning on other datasets. The model is tested on 7 standard datasets to evaluate its general recognition performance.

Synth90k is the synthetic text dataset proposed in [16]. The dataset contains 9 million images. Words are rendered onto natural images with random transformations and effects. All of the images in this dataset are taken for training.

SynthText is the synthetic text dataset proposed in [13]. It is proposed for text detection. We crop the words using the groundtruth word bounding boxes.

IIIT5k-Words (IIIT5k) [31]

contains 3000 test images. Each image is associated with a short, 50-word lexicon and a long, 1000-word lexicon.

Street View Text (SVT) [41] is collected from the Google Street View. The test set contains 647 images of cropped words. Each image is associated with a 50-word lexicon.

ICDAR 2003 (IC03) [29] contains 860 images of cropped word after filtering, discarding words that contain non-alphanumeric characters or have less than three characters. Each image has a 50-word lexicon defined in [41].

ICDAR 2013 (IC13) [23] inherits most images from IC03 and extends it with new images. The dataset is filtered by removing words that contain non-alphanumeric characters. The dataset contains 1015 images. No lexicon is provided.

ICDAR 2015 Incidental Text (IC15) [22] contains a lot of irregular text. Testing images are obtained by cropping the words using the groundtruth word bounding boxes.

SVT-Perspective (SVTP) is proposed in [33] for evaluating the performance of recognizing perspective text. Images in SVTP are picked from the side-view images in Google Street View. The dataset consists of 639 cropped images for testing.

CUTE80 (CUTE) is proposed in [34] for the curved text. It contains 80 high-resolution images taken in natural scenes. We crop the annotated words and obtain a test set of 288 images.

Implementation Details

During training, we set the trade-off weights of all losses as 1. We use the Adam optimizer with momentum term 0.9. The generator network and discriminator network are trained from scratch and the weights in each layer are initialized with a zero-mean Gaussian distribution with standard deviation 0.02, and biases are initialized with 0. We use the recognizer ASTER from the source code

111https://github.com/ayumiymk/aster.pytorch, whose performance is slightly different from the original paper.

The model is trained by batches of 256 examples for 50k iterations with 4 Tesla M40 GPUs. For super-resolution tasks, we use SynthText as our training data and filter the images whose size is smaller than 12832. In this way we only use 1.29 millions images from SynthText. The low-resolution images are 4 down-sampled from the original images. The learning rate is set to 1.0 initially and decayed to 0.1 and 0.01 at step 30k and 40k respectively.

IC13 BIUCBIC SRGAN TextSR Improve
PSNR 18.65 23.77 25.28 +1.51
SSIM 0.7304 0.8505 0.8823 +0.0318
IC15
PSNR 19.56 24.40 25.13 +0.73
SSIM 0.7738 0.8884 0.8993 +0.0109
IC03
PSNR 18.61 23.42 24.35 +0.93
SSIM 0.7299 0.8490 0.8765 +0.0275
SVT
PSNR 21.51 26.39 27.74 +1.35
SSIM 0.7932 0.8948 0.9140 +0.0192
SVTP
PSNR 21.93 25.80 25.83 +0.03
SSIM 0.8063 0.8866 0.8915 +0.0049
IIIT5K
PSNR 16.48 21.01 21.39 +0.38
SSIM 0.7361 0.8480 0.8648 +0.0168
CUTE
PSNR 15.56 19.85 21.98 +2.13
SSIM 0.6547 0.8013 0.8561 +0.0548
Table 1: Comparison of methods: BICUBIC, SRGAN and TextSR. Highest calculated measures (PSNR [dB], SSIM) in bold. It can be found that our TextSR clearly surpasses SRGAN in PSNR and SSIM in all datasets

Ablation Study

We first show the importance of content-aware on the super-resolution image quality by comparing with the method of directly using SRGAN [24]. Then, we compare our proposed method with the baseline text recognizer ASTER [38] to prove the effectiveness of applying our super-resolution strategy to the task of text recognition. Moreover, we visualize activations heatmaps of the generator to show the response on text areas.

Figure 3: The intermediate heatmap visualization of SRGAN and TextSR on the street-view image. It can be found that our TextSR have higher respond than SRGAN in text areas.

Effect on super-resolution. To demonstrate the impact on the super-resolution image quality, we carry out experiments on 7 datasets to evaluate the SRGAN and TextSR. Details are shown in Table 1. It can be found that our TextSR clearly surpasses SRGAN in PSNR and SSIM in all datasets. Moreover, seen in Fig. 2, our TextSR can indeed generate sharper high-resolution text images than SRGAN. We explain that it is due to the supervision of TPL, so that the generator indeed can understand the content in the text images.

Effect on text recognition. To demonstrate the effect of content-aware TPL, we set up three experiments as shown in Table 2. We down-sample the original images to five different scale and then use three ways to up-sample the images. The first is baseline without super-resolution, the second is with SRGAN. The last is equipped with our content-aware TPL. Their performance is evaluated on ICDAR2013 test dataset. It can be seen that although SRGAN can partly improve the performance when compared to BICUBIC, our TextSR can greatly boost up the performance in very tiny images. Specially, for images with 205, TextSR can improve surprisingly 22.8% from SRGAN.

Size 12832 6416 328 246 20
BICUBIC 90.5 89.7 63.9 30.8 16.1
SRGAN 90.6 90.1 72.7 44.9 20.0
TextSR 91.3 90.5 83.1 62.6 42.8
Improve +0.7 +0.4 +10.4 +17.7 +22.8
Table 2: Text Recognition accuracy without SR, with SRGAN and TextSR on IC13 dataset. As the input size decreases, the TextSR can impressively boost up the performance on very tiny text images.

Moreover, we show the intermediate activations heatmap. The activations is the penultimate layer of the generator. To obtain a more comprehensive comparison of response on text areas and irrelevant backgrounds, we choose a street-view picture to input the generator and visualize its activation heatmap, shown in Fig. 3. It can be found that with the guidance of TPL, TextSR can have stronger respond on text area than SRGAN, which explains the reason of TextSR generating more clear, sharp and identifiable text images.

Comparison to State-of-the-Art

Figure 4: Super-resolution images on scene text detection images without fine-tuning our TextSR. The first image is from SVT detection dataset and the second image is captured by us on the street. It can be easily found that our TextSR has strong ability to restore English texts. But we fail to restore Chinese because we have not trained with Chinese data.

We compare the performance of our model with other state-of-the-art models. Some datasets provide lexicons for constraining recognition outputs. When a lexicon is given, we simply replace the predicted word with the nearest lexicon word under the metric of edit distance.

Method ConvNet, Data IIIT5K SVT IC03 IC13 IC15 SVTP CUTE
0 50 0 0 0 0 0 0
Wang, Babenko, and Belongie 2011b[42] - - 57.0 - - - - - -
Mishra, Alahari, and Jawahar 2012b[32] - - 73.2 - - - - - -
Wang et al. 2012 [43] - - 70.0 - - - -
Bissacco et al. 2013 [2] - - - - - 87.6 - - -
Almazan et al. 2014 [1] - - 89.2 - - - -
Yao et al. 2014 [48] - - 75.9 - - - -
Rodriguez-Serrano, Gordo, and Perronnin 2015 [35] - - 70.0 - - - -
Jaderberg, Vedaldi, and Zisserman 2014 [20] - - 86.1 - - - -
Su and Lu 2014 [40] - - 83.0 - - - -
Gordo 2015 [9] - - 91.8 - - - - - -
Jaderberg et al. 2016 [18] VGG,90k - 95.4 80.7 93.1 90.8 - - -
Jaderberg et al. 2015a [17] VGG,90k - 93.2 71.7 89.6 81.8 - - -
Shi, Bai, and Yao 2016 [36] VGG,90k 81.2 97.5 82.7 91.9 89.6 - - -
Shi et al. 2016 [37] VGG,90k 81.9 95.5 81.9 90.1 88.6 - 71.8 59.2
Lee and Osindero 2016 [25] VGG,90k 78.4 96.3 80.7 88.7 90.0 - - -
Yang et al. 2017 [47] VGG,Private - 95.2 - - - - 75.8 69.3
Cheng et al. 2017 [4] ResNet,(90k, ST) 87.4 97.1 85.9 94.2 93.3 70.6 - -
ASTER [38] ResNet,(90k, ST) 93.4 99.2 93.6 94.5 91.8 76.1 78.5 79.5
ASTER (ReIm) ResNet,(90k, ST) 92.4 97.5 86.1 93.1 90.5 74.8 76.7 78.8
ASTER (ReIm) + TextSR ResNet,(90k, ST) 92.5 98.0 87.2 93.2 91.3 75.6 77.4 78.9
ASTER (ReIm) ResNet,(90k, ST, FT) 95.6 99.4 95.1 96.5 92.1 77.5 86.7 85.4
ASTER (ReIm) + TextSR ResNet,(90k, ST, FT) 95.6 99.4 95.1 96.5 92.4 79.0 87.1 87.2
Table 3: Performance across a number of methods and datasets. ReIm is our re-implemented ASTER. 50 are lexicons. 0 means no lexicon. ST and 90k means SynthText and Synth90k datasets. Private means using private training data. FT means fine-tuning on real datasets.

Table 3 compares the recognition accuracy across a number of methods. Our re-implementation of ASTER text recognizer is slightly different from the original paper. Equipped with TextSR, it achieves better performance in all of datasets, showing the effectiveness of our proposed method.

To further test whether TextSR can improve performance of the stronger text recognizer, we fine-tune ASTER on real datasets and obtain the more advanced model. Surprisingly, TextSR still boosts up its performance. Particularly, the improvement on IC15 reaches 1.5%. We claim it arises from that IC15 contains a large amount of small text images and TextSR greatly solves the problem.

Extension on text detection images.

We extend our TextSR on text detection images. It is more challenging because these images contain more irrelative background. The visualization results are shown in Fig. 4. In the first line we find that our TextSR results can automatically super-resolution text areas with out any detection.

To further evaluate the robustness of TextSR, we randomly choose a image in the street captured by our mobile phone. It is interesting that our TextSR can successfully restore English area but fail to restore Chinese area. We argue that it is obviously because we do not train the model with Chinese data. If we add Chinese data in training process, the restore of Chinese text can be immediately improved.

Conclusion

This work addresses the problem of small text recognition with content-aware super-resolution. To our knowledge, this is the first work attempting to solve small texts with super-resolution methods. We elaborately add a novel Text Perceptual Loss(TPL) to help the generator restore the content in text images. Compare to standard super-resolution methods, the proposed method pays more attention on generating information of the text itself, rather than texture of background area. It shows superior performance on various text recognition benchmarks. In the future, we will focus on reconstruct text from incomplete text images.

References

  • [1] J. Almazán, A. Gordo, A. Fornés, and E. Valveny (2014) Word spotting and recognition with embedded attributes. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: Table 3.
  • [2] A. Bissacco, M. Cummins, Y. Netzer, and H. Neven (2013) Photoocr: reading text in uncontrolled conditions. In Proc. IEEE Int. Conf. Comp. Vis., Cited by: Table 3.
  • [3] F. L. Bookstein (1989) Principal warps: thin-plate splines and the decomposition of deformations. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: Network Architecture.
  • [4] Z. Cheng, F. Bai, Y. Xu, G. Zheng, S. Pu, and S. Zhou (2017) Focusing attention: towards accurate text recognition in natural images. In Proc. IEEE Int. Conf. Comp. Vis., Cited by: Related Work, Table 3.
  • [5] C. Dong, C. C. Loy, K. He, and X. Tang (2015) Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: Related Work.
  • [6] C. Dong, C. C. Loy, and X. Tang (2016) Accelerating the super-resolution convolutional neural network. In Proc. Eur. Conf. Comp. Vis., Cited by: Related Work.
  • [7] G. Freedman and R. Fattal (2011) Image and video upscaling from local self-examples. ACM Transactions on Graphics. Cited by: Related Work.
  • [8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Proc. Adv. Neural Inf. Process. Syst., Cited by: Related Work.
  • [9] A. Gordo (2015) Supervised mid-level features for word image representation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: Table 3.
  • [10] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proc. Int. Conf. Mach. Learn., Cited by: Related Work.
  • [11] A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, and J. Schmidhuber (2008) A novel connectionist system for unconstrained handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: Network Architecture.
  • [12] S. Gu, W. Zuo, Q. Xie, D. Meng, X. Feng, and L. Zhang (2015) Convolutional sparse coding for image super-resolution. In Proc. IEEE Int. Conf. Comp. Vis., Cited by: Related Work.
  • [13] A. Gupta, A. Vedaldi, and A. Zisserman (2016) Synthetic data for text localisation in natural images. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: Datasets.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: Network Architecture.
  • [15] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: Related Work.
  • [16] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman (2014) Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227. Cited by: Datasets.
  • [17] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman (2015) Deep structured output learning for unconstrained text recognition. Proc. Int. Conf. Learn. Repren.. Cited by: Table 3.
  • [18] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman (2016) Reading text in the wild with convolutional neural networks. Int. J. Comp. Vis.. Cited by: Table 3.
  • [19] M. Jaderberg, K. Simonyan, A. Zisserman, et al. (2015) Spatial transformer networks. In Proc. Adv. Neural Inf. Process. Syst., Cited by: Related Work.
  • [20] M. Jaderberg, A. Vedaldi, and A. Zisserman (2014) Deep features for text spotting. In Proc. Eur. Conf. Comp. Vis., Cited by: Related Work, Table 3.
  • [21] J. Johnson, A. Alahi, and L. Fei-Fei (2016) Perceptual losses for real-time style transfer and super-resolution. In Proc. Eur. Conf. Comp. Vis., Cited by: SRGAN.
  • [22] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, et al. (2015) ICDAR 2015 competition on robust reading. In Proc. IEEE Int. Conf. Doc. Anal. and Recogn., Cited by: Datasets.
  • [23] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, and L. P. De Las Heras (2013) ICDAR 2013 robust reading competition. In Proc. IEEE Int. Conf. Doc. Anal. and Recogn., Cited by: Datasets.
  • [24] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2017) Photo-realistic single image super-resolution using a generative adversarial network. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: Introduction, Related Work, Network Architecture, Network Architecture, Ablation Study.
  • [25] C. Lee and S. Osindero (2016)

    Recursive recurrent nets with attention modeling for ocr in the wild

    .
    In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: Table 3.
  • [26] W. Liu, C. Chen, K. K. Wong, Z. Su, and J. Han (2016) STAR-net: a spatial attention residue network for scene text recognition.. In Proc. Brit. Mach. Vis. Conf., Cited by: Related Work.
  • [27] Z. Liu, Y. Li, F. Ren, W. L. Goh, and H. Yu (2018) Squeezedtext: a real-time scene text recognition by binary convolutional encoder-decoder network. In Proc. AAAI Conf. on Arti. Intel., Cited by: Related Work.
  • [28] S. Long, X. He, and C. Ya (2018) Scene text detection and recognition: the deep learning era. arXiv preprint arXiv:1811.04256. Cited by: Introduction.
  • [29] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, R. Young, K. Ashida, H. Nagai, M. Okamoto, H. Yamamoto, et al. (2005) ICDAR 2003 robust reading competitions: entries, results, and future directions. Int. J. Document Analysis and Recognition. Cited by: Datasets.
  • [30] C. Luo, L. Jin, and Z. Sun (2019) Moran: a multi-object rectified attention network for scene text recognition. Pattern Recognition. Cited by: Related Work.
  • [31] A. Mishra, K. Alahari, and C. Jawahar (2012) Top-down and bottom-up cues for scene text recognition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: Datasets.
  • [32] A. Mishra, K. Alahari, and C. Jawahar (2012) Top-down and bottom-up cues for scene text recognition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: Table 3.
  • [33] T. Quy Phan, P. Shivakumara, S. Tian, and C. Lim Tan (2013) Recognizing text with perspective distortion in natural scenes. In Proc. IEEE Int. Conf. Comp. Vis., Cited by: Datasets.
  • [34] A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan (2014) A robust arbitrary text detection system for natural scene images. Expert Systems with Applications. Cited by: Datasets.
  • [35] J. A. Rodriguez-Serrano, A. Gordo, and F. Perronnin (2015) Label embedding: a frugal baseline for text recognition. Int. J. Comp. Vis.. Cited by: Table 3.
  • [36] B. Shi, X. Bai, and C. Yao (2016) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: Table 3.
  • [37] B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai (2016) Robust scene text recognition with automatic rectification. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: Table 3.
  • [38] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai (2018) Aster: an attentional scene text recognizer with flexible rectification. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: Related Work, Network Architecture, Ablation Study, Table 3.
  • [39] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. Proc. Int. Conf. Learn. Repren.. Cited by: Text Perceptual Loss.
  • [40] B. Su and S. Lu (2014) Accurate scene text recognition based on recurrent neural network. In Proc. Asian Conf. Comp. Vis., Cited by: Table 3.
  • [41] K. Wang, B. Babenko, and S. Belongie (2011) End-to-end scene text recognition. In Proc. IEEE Int. Conf. Comp. Vis., Cited by: Datasets, Datasets.
  • [42] K. Wang, B. Babenko, and S. Belongie (2011) End-to-end scene text recognition. In Proc. IEEE Int. Conf. Comp. Vis., Cited by: Table 3.
  • [43] T. Wang, D. J. Wu, A. Coates, and A. Y. Ng (2012) End-to-end text recognition with convolutional neural networks. In Proc. IEEE Int. Conf. Patt. Recogn., Cited by: Table 3.
  • [44] W. Wang, E. Xie, X. Li, W. Hou, T. Lu, G. Yu, and S. Shao (2019) Shape robust text detection with progressive scale expansion network. arXiv preprint arXiv:1903.12473. Cited by: Introduction.
  • [45] W. Wang, E. Xie, X. Song, Y. Zang, W. Wang, T. Lu, G. Yu, and C. Shen (2019) Efficient and accurate arbitrary-shaped text detection with pixel aggregation network. arXiv preprint arXiv:1908.05900. Cited by: Introduction.
  • [46] E. Xie, Y. Zang, S. Shao, G. Yu, C. Yao, and G. Li (2018) Scene text detection with supervised pyramid context network. arXiv preprint arXiv:1811.08605. Cited by: Introduction.
  • [47] X. Yang, D. He, Z. Zhou, D. Kifer, and C. L. Giles (2017) Learning to read irregular text with attention mechanisms.. In Proc. Int. Joint Conf. Aritificial Intelligence, Cited by: Table 3.
  • [48] C. Yao, X. Bai, B. Shi, and W. Liu (2014) Strokelets: a learned multi-scale representation for scene text recognition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: Table 3.