Scene text recognition has witnessed rapid development with the advance of convolutional neural networks. Nonetheless, most of the previous methods may not work well in recognizing text with low resolution which is often seen in natural scene images. An intuitive solution is to introduce super-resolution techniques as pre-processing. However, conventional super-resolution methods in the literature mainly focus on reconstructing the detailed texture of natural images, which typically do not work well for text due to the unique characteristics of text. To tackle these problems, in this work, we propose a content-aware text super-resolution network to generate the information desired for text recognition. In particular, we design an end-to-end network that can perform super-resolution and text recognition simultaneously. Different from previous super-resolution methods, we use the loss of text recognition as the Text Perceptual Loss to guide the training of the super-resolution network, and thus it pays more attention to the text content, rather than the irrelevant background area. Extensive experiments on several challenging benchmarks demonstrate the effectiveness of our proposed method in restoring a sharp high-resolution image from a small blurred one, and show that the recognition performance clearly boosts up the performance of text recognizer. To our knowledge, this is the first work focusing on text super-resolution. Code will be released in https://github.com/xieenze/TextSR.READ FULL TEXT VIEW PDF
Low-resolution text images are often seen in natural scenes such as docu...
In this paper, we propose an end-to-end Korean singing voice synthesis s...
With the recent advancement in deep learning, we have witnessed a great
Automated plant diagnosis using images taken from a distance is often
Convolutional Neural Networks (CNNs) have made remarkable progress on sc...
Transcribing content from structural images, e.g., writing notes from mu...
Thermal imaging is a robust sensing technique but its consumer applicabi...
Scene text recognition is a fundamental and important task in computer vision, since it is usually a key step towards many downstream text-related applications, including document retrieval, card recognition, and many other Natural Language Processing (NLP) related applications. Text recognition has achieved remarkable success due to the development of Convolutional Neural Network (CNN) and text detection[46, 44, 45]. Many accurate and efficient methods have been proposed for most constrained scenarios (e.g., text in scanned copies or network images). Recent works focus on texts in natural scenes, which is much more challenging due to the high diversity of texts in blur, orientation, shape and low-resolution. A thorough survey on recent advantages of text recognition can be found in .
Modern text recognizers have achieved impressive results on clear text images. However, their performances drop sharply when recognizing blurred text caused by low resolution or camera shake. The main difficulty to recognize blurred text is the lack of detailed information about them. Super-resolution  is a plausible method to tackle this problem. However, traditional super-resolution methods aim to reconstruct the detailed texture of natural images, which is not applicable to the blurred text. Compared to the texture of natural images, scenes texts are of arbitrary poses, illumination and blur, super-resolution on scene text images is much more challenging. Therefore, we need a content-aware text super-resolution network to generate clear, sharp and identifiable text images for recognition.
To deal with these problems, we propose a content-aware text super-resolution network (TextSR), which combines a super-resolution network and a text recognition network. TextSR is an end-to-end network, in which the results of the text recognition network can feed back to guide the training of the super-resolution network. Under the guidance of the text recognition network, the super-resolution network would focus on refining the text region, and thus generate clear, sharp and identifiable text images. As shown in Fig. 1
, there are three main components in our network: generator, discriminator and text recognizer. In the generator, a super-resolution network is used to up-sample the small blurred text to a fine scale for recognition. Compared with bilinear and bicubic interpolation, the generator can partly reduce artifacts and improve the quality of up-sampled images with a large up-scaling factors. In the discriminator, a classification network is applied to distinguish the high-resolution image and generated super-resolution image for adversarial training. Nevertheless, even with such a sophisticated generator and discriminator, up-sampled images are unsatisfactory, usually blurring and lacking fine details, due to the lack of the text content-aware capability. Therefore, we introduce a novel Text Perceptual Loss (TPL) to make the generator produce identifiable and clear text images. The TPL is provided by the text recognizer to guide the generator to produce clear texts for easier recognition.
The contributions of this work are therefore three-fold: (1) We introduce a super-resolution method to facilitate scene text recognition, especially for small blurred text. (2) We propose a novel Text Perceptual Loss to make the generator be aware of content of text and produce recognition-friendly information. (3) We demonstrate the effectiveness of our proposed method on several challenging public benchmarks and outperforms the strong baseline approaches.
Super-Resolution Super-resolution aims to output a plausible high-resolution image that is consistent with a given low-resolution image. Classical approaches, such as bilinear, bicubic or designed filtering, leverage the insight that neighbouring pixels usually exhibit similar colours and generate the output by interpolating between the colours of neighbouring pixels according to a predefined formula. Data-driven approaches make use of training data to assist with generating high-resolution predictions and directly copy pixels from the most similar patches in the training dataset . In contrast, optimization-based approaches, formulate the problem as a sparse recovery problem, where the dictionary is learned from data 5]. A deep neural net is trained on the input and target output pairs to minimize some distance metric between the prediction and the ground truth. Subsequently, it was shown that enabling the network to learn the up-scaling filters directly can further increase performance both in terms of accuracy and speed .
Generative Adversarial Network Generative adversarial network (GAN)  introduces the adversarial loss to generate realistic-appearance images from random noises and achieves impressive results in many generating tasks, such as image generation, image editing, representation learning, image annotation, character transferring . GAN is firstly applied to super-resolution by  (SRGAN) and obtains promising results on natural images. However, the high resolution images directly generated by SRGAN lack fine details desired for the recognition task. Therefore, we put forward content-aware super-resolution pipeline to recover the information friendly for text recognition.
Text Recognition The task of text recognition is to recognize character sequences from the cropped word image patches. With the rapid development in deep learning, a large number of effective frameworks have emerged in text recognition. Early work 
adopts a bottom-up fashion which detects individual characters firstly and integrates them into a word, or a top-down manner, which treats the word image patch as a whole and recognizes it as a multi-class image classification problem. Considering that scene text generally appears as a character sequence, following works regard it as a sequence recognition problem and employs Recurrent Neural Network (RNNs) to model the sequential features. For instance, the Connectionist Temporal Classification(CTC)
loss is often combined with the RNN outputs for calculating the conditional probability between the predicted sequences and the target[26, 27]. Recently, an increasing number of recognition approaches based on the attention mechanism have achieved significant improvements [4, 30]. Among them, ASTER 
rectified oriented or curved text based on Spatial Transformer Network(STN) and then performed recognition using an attentional sequence-to-sequence model. In this work, we choose ASTER as our baseline.
In this section, we present our proposed method in detail. First, we give a brief description on the SRGAN. Then, the overall architecture of our method TextSR is presented, as shown in Fig. 1 and our novel Text Perceptual Loss.
The goal of super-resolution is to train a function that estimates for a given low-resolution input image its corresponding high-resolution counterpart. To achieve this, a generator network, parametrized by
, is optimized by super-resolution specific loss functionon training images and corresponding :
SRGAN defines as the weighted sum of the content loss and the adversarial loss. Instead of employing the pixel-wise loss, it uses the perceptual loss  as the content loss. The adversarial loss is implemented by a discriminator network , parametrized by , which is optimized in an alternating manner along with to solve the adversarial min-max problem:
The traditional way of leveraging SRGAN to help the task of text recognition is to separately train a generator that transforms the low-resolution image to high-resolution under the guidance of the adversarial loss. However, such an conventional generator may focus on reconstructing the detailed texture of natural images, which is not applicable to the text content. A more effective text super-resolution network needs a content-aware generator to produce clear, sharp and identifiable text images for recognition, rather than more details of the irrelevant background area. Therefore, we propose our method, TextSR.
Our generator network up-samples a low-resolution image and outputs a super-resolution image. Second, a discriminator is used to distinguish whether the images belong to SR or HR. Furthermore, we add an additional text recognition branch to guide the generator produce content-aware images.
Generator network. As shown in Fig. 1, we adopt a deep CNN architecture which has shown effectiveness for image super-resolution in . There are two deconvolutional layers in the network, and each layer consists of learned kernels which perform up-sampling a low-resolution image to a 2
super-resolution image. We use the batch normalization and rectified linear unit (ReLU) activation after each convolutional layer except the last layer. The generator network can up-sample a low-resolution image and outputs a 4super-resolution image.
Discriminator network. We employ the same architecture as in  for our backbone network in the discriminator, as shown in Fig. 1. The input is the super-resolution image or HR image, and the output is the probability of the input being an HR image.
Recognition Network. To show the effectiveness of our method, we adopt ASTER  to be our base recognition network. ASTER is a state-of-the-art text recognizer composed of a text rectification network and a text recognition network. The text rectification network is able to rectify the character arrangement in the input image by using Thin-Plate-Spline , a non-rigid deformation transformer to rearrange the irregular text into horizontal one.
Based on the rectified text image, the text recognition network directly predicts the sequence of characters through sequence-to-sequence translation. The text recognition network consists of two parts: encoder and decoder. The encoder is used to extract the feature of the rectified text image. It consists of residual blocks . Following the residual blocks, the feature map is converted into a feature sequence by being split along its row axis. There are two layers of Bidirectional LSTM (BLSTM)  to capture long-range dependencies in both directions. Each one consists of a pair of LSTMs. The outputs of the LSTMs are concatenated and through linearly projected layers before entering the next layer. The decoders are attentional LSTMs to recognizes 94 character classes, including digits, upper-case and lower-case letters, and 32 ASCII punctuation marks. At each time step, it predicts either a character or an end-of-sequence symbol (EOS). The loss of text recognition is defined as:
where is the length of symbol sequence, is the generated super-resolution image, is the ground-truth text represented by a character sequence, is the corresponding output of decoder.
Training generator network by only the adversarial loss from discriminator leads to the generator focusing on reconstructing the detailed texture of images without understanding the content in the text image. It works on natural images, but it is not practical on text images because content is more important than texture in text images. Here, we introduce Text Perceptual Loss (TPL) inspired by the perceptual loss, which is widely used in super-resolution and other low-level vision tasks. The perceptual loss uses a pre-trained VGG 
network and calculate the similarity of the feature map of super-resolution images and original images. Perceptual loss can make network understand the general content of the image since VGG is pre-trained on ImageNet, which contains 1000 kinds of objects.
To this end, we carefully design TPL by back-propagating the loss of text recognition into the training of the generator network. Specifically, the super-resolution images produced by the generator is directly put into text recognizer. As a result, the generator is trained to minimize on training images and corresponding text characters
TPL helps the generator produce more content-realistic images, which are more friendly for the text recognizer. There are three approaches to supervise the generator by TPL. Details are shown as belows:
End-to-end training ASTER and generator simultaneously from random initialization.
Training ASTER first and then end-to-end training ASTER and generator simultaneously.
Training ASTER first and then training generator while freezing the parameters of ASTER.
Our experiments find that they result in very similar performance so we only report the results of third approach.
During training, we prepare the LR-HR image pairs and train three sub-nets shown at the blue line in Fig. 1. During inference, we select those images which size are smaller than 12832, then put them into the generator. Finally these restored images are recognized by ASTER. The inference step is shown at red lines in Fig. 1.
We evaluate the proposed method on several challenging public benchmarks and compare its performance with other state-of-the-art methods.
The proposed model is trained on a combination of a synthetic dataset and the training set of ICDAR, without fine-tuning on other datasets. The model is tested on 7 standard datasets to evaluate its general recognition performance.
Synth90k is the synthetic text dataset proposed in . The dataset contains 9 million images. Words are rendered onto natural images with random transformations and effects. All of the images in this dataset are taken for training.
SynthText is the synthetic text dataset proposed in . It is proposed for text detection. We crop the words using the groundtruth word bounding boxes.
IIIT5k-Words (IIIT5k) 
contains 3000 test images. Each image is associated with a short, 50-word lexicon and a long, 1000-word lexicon.
Street View Text (SVT)  is collected from the Google Street View. The test set contains 647 images of cropped words. Each image is associated with a 50-word lexicon.
ICDAR 2003 (IC03)  contains 860 images of cropped word after filtering, discarding words that contain non-alphanumeric characters or have less than three characters. Each image has a 50-word lexicon defined in .
ICDAR 2013 (IC13)  inherits most images from IC03 and extends it with new images. The dataset is filtered by removing words that contain non-alphanumeric characters. The dataset contains 1015 images. No lexicon is provided.
ICDAR 2015 Incidental Text (IC15)  contains a lot of irregular text. Testing images are obtained by cropping the words using the groundtruth word bounding boxes.
SVT-Perspective (SVTP) is proposed in  for evaluating the performance of recognizing perspective text. Images in SVTP are picked from the side-view images in Google Street View. The dataset consists of 639 cropped images for testing.
CUTE80 (CUTE) is proposed in  for the curved text. It contains 80 high-resolution images taken in natural scenes. We crop the annotated words and obtain a test set of 288 images.
During training, we set the trade-off weights of all losses as 1. We use the Adam optimizer with momentum term 0.9. The generator network and discriminator network are trained from scratch and the weights in each layer are initialized with a zero-mean Gaussian distribution with standard deviation 0.02, and biases are initialized with 0. We use the recognizer ASTER from the source code111https://github.com/ayumiymk/aster.pytorch, whose performance is slightly different from the original paper.
The model is trained by batches of 256 examples for 50k iterations with 4 Tesla M40 GPUs. For super-resolution tasks, we use SynthText as our training data and filter the images whose size is smaller than 12832. In this way we only use 1.29 millions images from SynthText. The low-resolution images are 4 down-sampled from the original images. The learning rate is set to 1.0 initially and decayed to 0.1 and 0.01 at step 30k and 40k respectively.
We first show the importance of content-aware on the super-resolution image quality by comparing with the method of directly using SRGAN . Then, we compare our proposed method with the baseline text recognizer ASTER  to prove the effectiveness of applying our super-resolution strategy to the task of text recognition. Moreover, we visualize activations heatmaps of the generator to show the response on text areas.
Effect on super-resolution. To demonstrate the impact on the super-resolution image quality, we carry out experiments on 7 datasets to evaluate the SRGAN and TextSR. Details are shown in Table 1. It can be found that our TextSR clearly surpasses SRGAN in PSNR and SSIM in all datasets. Moreover, seen in Fig. 2, our TextSR can indeed generate sharper high-resolution text images than SRGAN. We explain that it is due to the supervision of TPL, so that the generator indeed can understand the content in the text images.
Effect on text recognition. To demonstrate the effect of content-aware TPL, we set up three experiments as shown in Table 2. We down-sample the original images to five different scale and then use three ways to up-sample the images. The first is baseline without super-resolution, the second is with SRGAN. The last is equipped with our content-aware TPL. Their performance is evaluated on ICDAR2013 test dataset. It can be seen that although SRGAN can partly improve the performance when compared to BICUBIC, our TextSR can greatly boost up the performance in very tiny images. Specially, for images with 205, TextSR can improve surprisingly 22.8% from SRGAN.
Moreover, we show the intermediate activations heatmap. The activations is the penultimate layer of the generator. To obtain a more comprehensive comparison of response on text areas and irrelevant backgrounds, we choose a street-view picture to input the generator and visualize its activation heatmap, shown in Fig. 3. It can be found that with the guidance of TPL, TextSR can have stronger respond on text area than SRGAN, which explains the reason of TextSR generating more clear, sharp and identifiable text images.
We compare the performance of our model with other state-of-the-art models. Some datasets provide lexicons for constraining recognition outputs. When a lexicon is given, we simply replace the predicted word with the nearest lexicon word under the metric of edit distance.
|Wang, Babenko, and Belongie 2011b||-||-||57.0||-||-||-||-||-||-|
|Mishra, Alahari, and Jawahar 2012b||-||-||73.2||-||-||-||-||-||-|
|Wang et al. 2012 ||-||-||70.0||-||-||-||-|
|Bissacco et al. 2013 ||-||-||-||-||-||87.6||-||-||-|
|Almazan et al. 2014 ||-||-||89.2||-||-||-||-|
|Yao et al. 2014 ||-||-||75.9||-||-||-||-|
|Rodriguez-Serrano, Gordo, and Perronnin 2015 ||-||-||70.0||-||-||-||-|
|Jaderberg, Vedaldi, and Zisserman 2014 ||-||-||86.1||-||-||-||-|
|Su and Lu 2014 ||-||-||83.0||-||-||-||-|
|Gordo 2015 ||-||-||91.8||-||-||-||-||-||-|
|Jaderberg et al. 2016 ||VGG,90k||-||95.4||80.7||93.1||90.8||-||-||-|
|Jaderberg et al. 2015a ||VGG,90k||-||93.2||71.7||89.6||81.8||-||-||-|
|Shi, Bai, and Yao 2016 ||VGG,90k||81.2||97.5||82.7||91.9||89.6||-||-||-|
|Shi et al. 2016 ||VGG,90k||81.9||95.5||81.9||90.1||88.6||-||71.8||59.2|
|Lee and Osindero 2016 ||VGG,90k||78.4||96.3||80.7||88.7||90.0||-||-||-|
|Yang et al. 2017 ||VGG,Private||-||95.2||-||-||-||-||75.8||69.3|
|Cheng et al. 2017 ||ResNet,(90k, ST)||87.4||97.1||85.9||94.2||93.3||70.6||-||-|
|ASTER ||ResNet,(90k, ST)||93.4||99.2||93.6||94.5||91.8||76.1||78.5||79.5|
|ASTER (ReIm)||ResNet,(90k, ST)||92.4||97.5||86.1||93.1||90.5||74.8||76.7||78.8|
|ASTER (ReIm) + TextSR||ResNet,(90k, ST)||92.5||98.0||87.2||93.2||91.3||75.6||77.4||78.9|
|ASTER (ReIm)||ResNet,(90k, ST, FT)||95.6||99.4||95.1||96.5||92.1||77.5||86.7||85.4|
|ASTER (ReIm) + TextSR||ResNet,(90k, ST, FT)||95.6||99.4||95.1||96.5||92.4||79.0||87.1||87.2|
Table 3 compares the recognition accuracy across a number of methods. Our re-implementation of ASTER text recognizer is slightly different from the original paper. Equipped with TextSR, it achieves better performance in all of datasets, showing the effectiveness of our proposed method.
To further test whether TextSR can improve performance of the stronger text recognizer, we fine-tune ASTER on real datasets and obtain the more advanced model. Surprisingly, TextSR still boosts up its performance. Particularly, the improvement on IC15 reaches 1.5%. We claim it arises from that IC15 contains a large amount of small text images and TextSR greatly solves the problem.
We extend our TextSR on text detection images. It is more challenging because these images contain more irrelative background. The visualization results are shown in Fig. 4. In the first line we find that our TextSR results can automatically super-resolution text areas with out any detection.
To further evaluate the robustness of TextSR, we randomly choose a image in the street captured by our mobile phone. It is interesting that our TextSR can successfully restore English area but fail to restore Chinese area. We argue that it is obviously because we do not train the model with Chinese data. If we add Chinese data in training process, the restore of Chinese text can be immediately improved.
This work addresses the problem of small text recognition with content-aware super-resolution. To our knowledge, this is the first work attempting to solve small texts with super-resolution methods. We elaborately add a novel Text Perceptual Loss(TPL) to help the generator restore the content in text images. Compare to standard super-resolution methods, the proposed method pays more attention on generating information of the text itself, rather than texture of background area. It shows superior performance on various text recognition benchmarks. In the future, we will focus on reconstruct text from incomplete text images.
Recursive recurrent nets with attention modeling for ocr in the wild. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: Table 3.